Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

From testing of risk parameter #333

Open
Kyoshido opened this issue Nov 23, 2022 · 0 comments
Open

From testing of risk parameter #333

Kyoshido opened this issue Nov 23, 2022 · 0 comments

Comments

@Kyoshido
Copy link
Member

Hello,
I was testing funkcionality of risk and risk_threshold parameter.
In recordSwap() was parameter swaprate set to 0, because I wanted to test swapping only of risky values.
But when checking which households were swapped, there was one non-risky household that was swapped.
This issue is for somebody with enough time to dig up why it happened.

Data used:

seed <- 2021
set.seed(seed)
nhid <- 10000
dat <- sdcMicro::createDat( nhid )

k_anonymity <- 0
swaprate <- .05
similar <- list(c("hsize"))
hier <- c("nuts1","nuts2","nuts3")
risk_variables <- c("ageGroup","national")
hid <- "hid"
expect_swaps <- swaprate * nhid
risk <- matrix(data = rep(0,3*dim(dat)[1]), ncol = 3)
set.seed(seed)
index <- sample(1:(length(hier)*nrow(dat)), size = expect_swaps)
risk[index] <- 99 # risk value
risk_threshold <- 0.9

Then run recordSwap() function:

dat_s <- recordSwap(
data = dat, hid = hid, hierarchy = hier,
similar,
risk = risk, # TESTING
risk_threshold = risk_threshold, # TESTING
carry_along = NULL,
return_swapped_id = TRUE,
seed=seed,
swaprate = 0 # to test swaping based only on risk values
)
risk does not contain column names; the first column in risk will be used for the first hierarchy level, e.g nuts1 and so on.
risk was adjusted in order to give each household member the maximum household risk value
Donor household was not found in 16 case(s).
See TRS_logfile.txt for a detailed list

Following calculation of households that should be swapped

risk_check <- copy(data.table(risk))
risk_variables_names <- copy(colnames(risk_check))
risk_check[,hid_help:=dat[[5]]] # 5 = hid
risk_check[,c(risk_variables_names):=lapply(.SD,max), .SDcols=c(risk_variables_names), by=.(hid_help)] # assign to each household its max value
risk_check_uniq <- unique(risk_check) # calculate max value in household
risk_check_uniq[, Sum := rowSums(.SD), .SDcols = 1:3]
n_of_hd_to_swap <- sum(risk_check_uniq$Sum > 1) # number of households that should be swapped
n_of_swapped_hd <- dat_s[hid != hid_swapped, uniqueN(hid)] # number of swapped households

n_of_hd_to_swap
[1] 489
n_of_swapped_hd
[1] 474

So what is difference in households

hd_to_swap <- risk_check_uniq$hid_help[risk_check_uniq$Sum > 1
] # number of households that was should be swapped
swapped_hd <- dat_s[hid != hid_swapped, unique(hid)
] # number of swapped households

Here are those 16 hd that could not find the donor household

hd_to_swap[which(!(hd_to_swap %in% swapped_hd))] # hd to swapp not in swapped hd
[1] 1540 2823 3212 3588 3876 3996 4841 5717 5962 6885 7061 8621 9101 9168 9206 9257

Here is one non-risky household that was swapped. And the question is why?

swapped_hd[which(!(swapped_hd %in% hd_to_swap))] # swapped hd not in hd to swap
[1] 10000

Here are data for this household

dat_s[which(dat_s$hid==10000),]

nuts1 nuts2 nuts3 lau2 hid hsize ageGroup gender national htype hincome hid_swapped
3 34 3405 24075 10000 3 6 2 4 10 2 5064
3 34 3405 24075 10000 3 1 2 1 10 2 5064
3 34 3405 24075 10000 3 4 2 3 10 2 5064

Here are risk values for this household

risk_check[which(dat_s$hid==10000),]

V1 V2 V3 hid_help
0 0 0 10000
0 0 0 10000
0 0 0 10000

On recomendation of @JohannesGuss I created this issue.
Whole script can be found in attachment.
script_issue.docx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant