ccao-data · dfsnow · Jan 19, 2024 · Jan 18, 2024 · Jan 18, 2024 · Jan 18, 2024
@@ -20,6 +20,7 @@ labels: release
 - [ ] Using the command line, grab the final compressed database file from the temporary directory (found at `db_path` after running `data-raw/create_db.R`) and move it to the project directory. Rename the file `ptaxsim-<TAX_YEAR>.<MAJOR VERSION>.<MINOR VERSION>.db.bz2`
 - [ ] Decompress the database file for local testing using `pbzip2`. The typical command will be something like `pbzip2 -d -k ptaxsim-2021.0.2.db.bz2`
 - [ ] Rename the decompressed local database file to `ptaxsim.db` for local testing. This is the file name that the unit tests and vignettes expect
+- [ ] Use [sqldiff](https://www.sqlite.org/sqldiff.html) or a similar tool to compare the new database file to the previous version. Ensure that the changes are expected
 - [ ] Restart R. Then run the unit tests (`devtools::test()` in the console) and vignettes (`pkgdown::build_site()` in the console) locally
 - [ ] Knit the `README.Rmd` file to update the database link at the top of the README. The link is pulled from the `ptaxsim.db` file's `metadata` table
 - [ ] If necessary, update the database diagrams in the README with any new fields or tables

@@ -22,7 +22,7 @@ Imports:
     glue,
     RSQLite,
     utils
-RoxygenNote: 7.2.3
+RoxygenNote: 7.3.0
 Suggests:
     arrow,
     covr,
@@ -36,8 +36,10 @@ Suggests:
     httr,
     knitr,
     lintr,
+    noctua,
     odbc,
     openxlsx,
+    pdftools,
     pkgdown,
     prettymapr,
     purrr,
@@ -60,4 +62,4 @@ Remotes:
     paleolimbot/geoarrow,
     ropensci/tabulizer
 Config/Requires_DB_Version: 2021.0.4
-Config/Wants_DB_Version: 2021.0.4
+Config/Wants_DB_Version: 2022.0.0
@@ -193,7 +193,7 @@ tax_bill <- function(year_vec,
 
   # Calculate the exemption effect by subtracting the exempt amount from
   # the total taxable EAV
-  dt[, agency_tax_rate := agency_total_ext / agency_total_eav]
+  dt[, agency_tax_rate := agency_total_ext / as.numeric(agency_total_eav)]
   dt[, tax_amt_exe := exe_total * agency_tax_rate]
   dt[, tax_amt_pre_exe := round(eav * agency_tax_rate, 2)]
   dt[, tax_amt_post_exe := round(tax_amt_pre_exe - tax_amt_exe, 2)]

@@ -555,9 +555,9 @@ erDiagram
 
 ## Notes and caveats
 
-- Currently, the per-district tax calculations for properties in the Red-Purple Modernization (RPM) TIF are slightly flawed. However, the total tax bill per PIN is still accurate. See issue [#11](#11) for more information.
-- Special Service Area (SSA) rates must be calculated manually when creating counterfactual bills. See issue [#31](#31) for more information.
-- In rare instances, a TIF can have multiple `agency_num` identifiers (usually there's only one per TIF). The `tif_crosswalk` table determines what the "main" `agency_num` is for each TIF and pulls the name and TIF information using that identifier. See issue [#39](#39) for more information.
+- Currently, the per-district tax calculations for properties in the Red-Purple Modernization (RPM) TIF are slightly flawed. However, the total tax bill per PIN is still accurate. See issue [#4](https://github.com/ccao-data/ptaxsim/issues/4) for more information.
+- Special Service Area (SSA) rates must be calculated manually when creating counterfactual bills. See issue [#3](https://github.com/ccao-data/ptaxsim/issues/3) for more information.
+- In rare instances, a TIF can have multiple `agency_num` identifiers (usually there's only one per TIF). The `tif_crosswalk` table determines what the "main" `agency_num` is for each TIF and pulls the name and TIF information using that identifier. See issue [GitLab #39](https://gitlab.com/ccao-data-science---modeling/packages/ptaxsim/-/issues/39) for more information.
 - PTAXSIM is relatively memory-efficient and can calculate every district line-item for every tax bill for the last 15 years (roughly 350 million rows). However, the memory required for this calculation is substantial (around 100 GB).
 - PTAXSIM's accuracy is measured automatically with an [integration test](tests/testthat/test-accuracy.R). The test takes a random sample of 1 million PINs, calculates the total bill for each PIN, and compares it to the real total bill.
 - This repository contains an edited version of PTAXSIM's commit history. Historical Git LFS and other data files (.csv, .xlsx, etc.) were removed in the transition to GitHub. The most current version of these files is available starting in commit [1f06639](https://github.com/ccao-data/ptaxsim/commit/1f06639d98a720999222579b7ff61bcce061f1ec). If you need the historical LFS files for any reason, please visit the [GitLab archive](https://gitlab.com/ccao-data-science---modeling/packages/ptaxsim) of this repository.

@@ -37,13 +37,13 @@ Table of Contents
 > installation](#database-installation) for details.
 >
 > [**Link to PTAXSIM
-> database**](https://ccao-data-public-us-east-1.s3.amazonaws.com/ptaxsim/ptaxsim-2021.0.4.db.bz2)
-> (DB version: 2021.0.4; Last updated: 2023-04-28 23:40:05)
+> database**](https://ccao-data-public-us-east-1.s3.amazonaws.com/ptaxsim/ptaxsim-2022.0.0.db.bz2)
+> (DB version: 2022.0.0; Last updated: 2024-01-19 04:40:35)
 
 PTAXSIM is an R package/database to approximate Cook County property tax
 bills. It uses real assessment, exemption, TIF, and levy data to
 generate historic, line-item tax bills (broken out by taxing district)
-for any property from 2006 to 2021. Given some careful assumptions and
+for any property from 2006 to 2022. Given some careful assumptions and
 data manipulation, it can also provide hypothetical, but factually
 grounded, answers to questions such as:
 
@@ -173,9 +173,9 @@ database:
 
 1.  Download the compressed database file from the CCAO’s public S3
     bucket. [Link
-    here](https://ccao-data-public-us-east-1.s3.amazonaws.com/ptaxsim/ptaxsim-2021.0.4.db.bz2).
+    here](https://ccao-data-public-us-east-1.s3.amazonaws.com/ptaxsim/ptaxsim-2022.0.0.db.bz2).
 2.  (Optional) Rename the downloaded database file by removing the
-    version number, i.e. ptaxsim-2021.0.4.db.bz2 becomes
+    version number, i.e. ptaxsim-2022.0.0.db.bz2 becomes
     `ptaxsim.db.bz2`.
 3.  Decompress the downloaded database file. The file is compressed
     using [bzip2](https://sourceware.org/bzip2/).
@@ -863,15 +863,18 @@ erDiagram
 
 - Currently, the per-district tax calculations for properties in the
   Red-Purple Modernization (RPM) TIF are slightly flawed. However, the
-  total tax bill per PIN is still accurate. See issue [\#4](https://github.com/ccao-data/ptaxsim/issues/4) for
-  more information.
+  total tax bill per PIN is still accurate. See issue
+  [\#4](https://github.com/ccao-data/ptaxsim/issues/4) for more
+  information.
 - Special Service Area (SSA) rates must be calculated manually when
-  creating counterfactual bills. See issue [\#3](https://github.com/ccao-data/ptaxsim/issues/3) for more
+  creating counterfactual bills. See issue
+  [\#3](https://github.com/ccao-data/ptaxsim/issues/3) for more
   information.
 - In rare instances, a TIF can have multiple `agency_num` identifiers
   (usually there’s only one per TIF). The `tif_crosswalk` table
   determines what the “main” `agency_num` is for each TIF and pulls the
-  name and TIF information using that identifier. See archived issue [\#39](https://gitlab.com/ccao-data-science---modeling/packages/ptaxsim/-/issues/39)
+  name and TIF information using that identifier. See issue [GitLab
+  \#39](https://gitlab.com/ccao-data-science---modeling/packages/ptaxsim/-/issues/39)
   for more information.
 - PTAXSIM is relatively memory-efficient and can calculate every
   district line-item for every tax bill for the last 15 years (roughly

@@ -45,7 +45,6 @@ file_names <- list.files(
 
 
 
-
 # agency_fund ------------------------------------------------------------------
 
 # Load the detail sheet from each agency file. This includes the levy and rate
@@ -64,7 +63,7 @@ agency_fund <- map_dfr(file_names, function(file) {
       "loss", "loss_percent", "fund_loss"
     ))) %>%
     rename_with(~"levy_plus_loss", any_of(c(
-      "levy_and_loss", "fund_levy_plus_loss"
+      "levy_and_loss", "fund_levy_plus_loss", "levy_loss"
     ))) %>%
     rename_with(~"rate_ceiling", any_of(c(
       "ceiling", "rate_ceiling", "fund_rate_ceiling"
@@ -189,7 +188,7 @@ arrow::write_parquet(
 # EAV, final extension, and much more
 agency <- map_dfr(file_names, function(file) {
   message("Reading: ", file)
-  readxl::read_xlsx(file) %>%
+  readxl::read_xlsx(file, sheet = 1) %>%
     set_names(snakecase::to_snake_case(names(.))) %>%
     mutate(
       across(
@@ -235,9 +234,12 @@ agency <- map_dfr(file_names, function(file) {
       "reduction_percent", "reduction_factor", "clerk_reduction_factor"
     ))) %>%
     rename_with(~"total_non_cap_ext", any_of(c(
-      "total_non_cap_ext", "final_non_cap_ext"
+      "total_non_cap_ext", "final_non_cap_ext", "total_non_cap_extension"
+    ))) %>%
+    rename_with(~"total_ext", any_of(c(
+      "total_ext", "final_ext",
+      "grand_total_ext"
     ))) %>%
-    rename_with(~"total_ext", any_of(c("total_ext", "final_ext"))) %>%
     # Select, order, and rename columns
     select(
       year,
@@ -281,7 +283,7 @@ agency <- map_dfr(file_names, function(file) {
       0,
       cty_cook_eav
     ),
-    across(starts_with("cty_"), replace_na, 0),
+    across(starts_with("cty_"), ~ replace_na(.x, 0)),
     # Make all percentages decimals
     across(
       c(pct_burden, reduction_pct),
@@ -296,20 +298,20 @@ agency <- map_dfr(file_names, function(file) {
   arrange(year, agency_num) %>%
   # Coerce columns to expected types
   mutate(
-    across(c(year), as.character),
+    across(c(year), ~ as.character(.x)),
     across(
       c(
         lim_numerator, lim_denominator, prior_eav:cty_total_eav,
         total_levy, total_max_levy, total_reduced_levy, total_final_levy
       ),
-      as.integer64
+      ~ as.integer64(.x)
     ),
     across(
       c(
         lim_rate, pct_burden, total_prelim_rate, total_final_rate,
         reduction_pct, total_non_cap_ext, total_ext
       ),
-      as.double
+      ~ as.double(.x)
     )
   )
 

@@ -1,7 +1,7 @@
 library(arrow)
 library(dplyr)
 library(miniUI)
-library(tabulizer)
+library(pdftools)
 library(tidyr)
 library(stringr)
 
@@ -14,27 +14,34 @@ row_to_names <- function(df) {
 # The goal of this script is to create a data frame of Consumer Price Indices
 # CPI-U used by PTELL to calculate/cap property tax extensions
 # We can load the historical CPIs from a PDF provided by the State of Illinois
+# https://tax.illinois.gov/content/dam/soi/en/web/tax/localgovernments/property/documents/cpihistory.pdf # nolint
 
 # Paths for local raw data storage and remote storage on S3
 remote_bucket <- Sys.getenv("S3_REMOTE_BUCKET")
 remote_path <- file.path(remote_bucket, "cpi", "part-0.parquet")
 
-# Extract the table only (no headers), then manually assign header
-cpi_ext <- extract_areas(file = "data-raw/cpi/cpihistory.pdf")[[1]]
-cpi <- as_tibble(cpi_ext[, c(1, 2, 4, 5, 6)])
-cpi <- setNames(cpi, c("year", "cpi", "ptell_cook", "comments", "levy_year"))
+cpi <- pdftools::pdf_text(pdf = "data-raw/cpi/cpihistory.pdf") %>%
+  str_extract(., regex("1991.*", dotall = TRUE)) %>%
+  str_remove_all(., "\\(5 % for Cook\\)") %>%
+  str_split(., "\n") %>%
+  unlist() %>%
+  tibble(vals = `.`) %>%
+  mutate(vals = str_squish(vals)) %>%
+  separate_wider_delim(
+    col = vals,
+    names = c("year", "cpi", "pct", "ptell_cook", "levy_year", "year_paid"),
+    delim = " ", too_few = "align_start", too_many = "drop"
+  )
 
-# Merge Cook rate into main column
 cpi <- cpi %>%
   mutate(
     across(c(year, levy_year), as.character),
     across(c(cpi), as.numeric),
-    across(c(ptell_cook, comments), readr::parse_number),
-    ptell_cook = ifelse(!is.na(comments), comments, ptell_cook),
+    across(c(ptell_cook), readr::parse_number),
     ptell_cook = ptell_cook / 100
   ) %>%
-  select(-comments) %>%
-  filter(year != "1991") %>%
+  filter(year != "1991", year != "", year != "CPI") %>%
+  select(-pct, -year_paid) %>%
   arrange(year)
 
 # Write to S3

@@ -37,7 +37,7 @@ db_send_queries <- function(conn, sql) {
 # changes. This is checked against Config/Requires_DB_Version in the DESCRIPTION
 # file via check_db_version(). Schema is:
 # "MAX_YEAR_OF_DATA.MAJOR_VERSION.MINOR_VERSION"
-db_version <- "2021.0.4"
+db_version <- "2022.0.0"
 
 # Set the package version required to use this database. This is checked against
 # Version in the DESCRIPTION file. Basically, we have a two-way check so that

@@ -2,6 +2,7 @@ library(arrow)
 library(DBI)
 library(dplyr)
 library(geoarrow)
+library(noctua)
 library(odbc)
 library(sf)
 library(tidyr)
@@ -27,6 +28,10 @@ ccaodata <- dbConnect(
   .connection_string = Sys.getenv("DB_CONFIG_CCAODATA")
 )
 
+# Establish a connection the Data Department's Athena data warehouse. We'll use
+# values from here to fill in any missing values from the legacy system
+ccaoathena <- dbConnect(noctua::athena())
+
 # Pull AV and class from the Clerk and HEAD tables, giving preference to values
 # from the Clerk table in case of mismatch (except for property class).
 # These tables are pulled from the AS/400 and will be pulled from iasWorld
@@ -82,13 +87,38 @@ pin <- dbGetQuery(
     tax_bill_total = tidyr::replace_na(tax_bill_total, 0)
   )
 
+# Pull AVs from Athena to fill in any missingness from the legacy system
+pin_athena <- dbGetQuery(
+  ccaoathena,
+  "
+  SELECT DISTINCT
+      pin,
+      year,
+      mailed_tot,
+      certified_tot,
+      board_tot
+  FROM default.vw_pin_value
+  WHERE year >= '2006'
+  "
+) %>%
+  mutate(
+    across(c(year, pin), as.character),
+    across(c(ends_with("_tot")), as.integer)
+  )
+
 pin_fill <- pin %>%
   # There are a few (less than 100) rows with Clerk AVs split for the same PIN.
   # Sum to get 1 record per PIN, then keep the record with the highest AV
   group_by(year, pin) %>%
   mutate(av_clerk = sum(av_clerk)) %>%
   ungroup() %>%
   distinct(year, pin, .keep_all = TRUE) %>%
+  left_join(pin_athena, by = c("year", "pin")) %>%
+  mutate(
+    av_board = ifelse(is.na(av_board), board_tot, av_board),
+    av_certified = ifelse(is.na(av_certified), certified_tot, av_certified),
+    av_mailed = ifelse(is.na(av_mailed), mailed_tot, av_mailed)
+  ) %>%
   # A few (less than 500) values are missing from the mailed assessment stage
   # AV column. We can replace any missing mailed value with certified value
   # from the same year. Only 2 board/certified values are missing, and both are
@@ -97,7 +127,8 @@ pin_fill <- pin %>%
     av_board = ifelse(is.na(av_board), 0L, av_board),
     av_certified = ifelse(is.na(av_certified), 0L, av_certified),
     av_mailed = ifelse(is.na(av_mailed), av_certified, av_mailed)
-  )
+  ) %>%
+  select(-ends_with("_tot"))
 
 # Write to S3
 arrow::write_dataset(