-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for custom formats, informats, and lengths #650
Comments
Just want to state that adding this support would really be game changing within pharma. There's a misconception that XPT files must be created in SAS. While SASXport can generate compliant XPTs, its performance is an issue, but fixes these issues within {haven} would bring this capability into the tidyverse, and make using R more accessible regulatory submission activities. |
Hey @elimillera, thanks for the feature request! I'll have a look, none of this sounds particularly complicated. haven currently allows you to set formats using the Have you tried using this for format setting? attr(df$my_var, "format.sas") <- "DATE9" |
We are very much a {tidyverse} shop as a first stop for solutions - it would be tremendous if {haven} could close this gap for users creating compliant XPTs for regulatory submissions. |
Jumping on the bandwagon here to show support that these features would be very helpful and used by many in pharma as we transition to open-source ways of meeting FDA requirements. |
Having this functionality in {haven} would boost adoption of R in Pharma even further as execuses such as "R cannot create submission ready datasets" become invalid. Please implement this feature! |
+1 for the need for this from the pharma industry - thanks for requesting Eli! |
Hi all, To confirm, as noted above there is a mechanism for setting formats in haven already, by setting the See the example above: attr(df$my_var, "format.sas") <- "DATE9" When writing xpt files the Having said that, I've had a play and it looks like there's a bug in ReadStat so formats don't always write out correctly, will have a closer look into it but it doesn't seem too difficult to fix. Currently the variable length is set to 8 for numeric variables and the maximum string length for character variables. |
How might one write a null vector with write_xpt that does not result in a length of zero? Our experience trying to bring that data into a SAS environment via PROC COPY has been that, well, PROC COPY does not like zero length vectors. Our workaround was to drop the null vectors before writing, but that is not ideal and is.na is not a sufficient condition for identifying all vectors that result in a zero length variable in the resulting XPT. |
Hi @MichaelRimler, do you mean writing a data frame containing character vectors for which all records are blank? Currently this would write a length of zero since that is the max string length. I think to make SAS happy we need a min length of 1 here. To identify vectors that will have a length of zero in the mean time you can just check the max string length, something like this will do it: mtcars$blank_string <- ""
lapply(mtcars, function(x) { max(nchar(x)) })
#> $mpg
#> [1] 4
#>
#> $cyl
#> [1] 1
#>
#> $disp
#> [1] 5
#>
#> $hp
#> [1] 3
#>
#> $drat
#> [1] 4
#>
#> $wt
#> [1] 5
#>
#> $qsec
#> [1] 5
#>
#> $vs
#> [1] 1
#>
#> $am
#> [1] 1
#>
#> $gear
#> [1] 1
#>
#> $carb
#> [1] 1
#>
#> $blank_string
#> [1] 0 Created on 2021-11-21 by the reprex package (v2.0.1) |
@gorcha Yes, length of 1 would be good. In one of our workflows, we need to go between R and SAS and XPT is the cleanest intermediary to move from data frame to sas7bdat. But, when a data frame has such a vector, though write_xpt has no challenge, PROC COPY does not like the transfer back. Our solution (workaround) is to drop the vectors before writing, but it is not ideal. Eliminating the possibility that an XPT is created with length zero would resolve the problem as we’ve experienced it. |
Format writing fix PR in WizardMac/ReadStat#258 |
@gorcha whenever these feature updates are ready, we're happy to put together some practical testing to ensure that the Pharma needs are met. There's some compliance testing software that's used prior to submissions (called Pinnacle 21), which validates the XPT compliance along with our industry specific data standards. That combined with the handoff back and forth between R and SAS described by @MichaelRimler will verify everything we need. |
Thanks @mstackhouse! There were a couple of changes required in ReadStat to fix up XPT format writing, and it'll take a little while for these changes to flow down stream to haven. I'll let you know once I've got something ready for testing. |
@gorcha it looks like your changes in ReadStat were merged. I'm curious - are there other things that need to follow for this to flow downstream into haven? |
Hi @mstackhouse, this has been merged into the dev branch in ReadStat, but we generally wait for a ReadStat release before merging in to haven (unless it's a small/simple change) to make sure we don't diverge from the upstream code base. The releases are relatively frequent, but it depends on what else is going on in ReadStat land. |
@gorcha Thanks! That's exactly what I was interested in - what milestones haven needs from ReadStat for those changes to flow upstream. Thanks for following up! |
No worries! |
Hi @mstackhouse and @elimillera, I've pulled in the latest release candidate from ReadStat and merged in the changes on the write_xpt branch if you'd like to test it out with your compliance testing software - you can install from the branch using Both formats and informats are set using the For e.g.: library(haven)
df <- data.frame(
char_var = "Hello!"
)
attr(df$char_var, "format.sas") <- "$CHAR10"
attr(df$char_var, "width") <- 10 I've also updated the minimum default string length to 1, so a blank character vector will have length 1 instead of 0 by default. |
Hi @gorcha So good news - I was able get lengths to all set appropriately, and verified it in SAS and I was able to get the dataset through the compliance checks. But formats and informats aren't showing up. Here's my code: library(dplyr)
library(haven)
ae <- haven::read_xpt(url("https://github.com/phuse-org/TestDataFactory/raw/main/Updated/TDF_SDTM/ae.xpt"))
apply_lengths <- function(.data) {
types <- purrr::map_chr(.data, class)
lengths <- ifelse(types == "character", 200, 8)
purrr::walk2(names(lengths), lengths, ~ {attr(.data[[.x]], "width") <<- .y})
.data
}
ae <- ae %>%
apply_lengths()
attr(ae$AESTDY, 'formats.sas') <- "8.2"
attr(ae$STUDYID, 'formats.sas') <- "$CHAR10"
haven::write_xpt(ae, './ae.xpt') The package version is installed appropriately: > packageVersion('haven')
[1] ‘2.4.3.9001’ And I can see the attributes, which I believe are set correctly: > attr(ae$STUDYID, 'formats.sas')
[1] "$CHAR10"
> attr(ae$AESTDY, 'formats.sas')
[1] "8.2" |
Hi @mstackhouse, thanks for testing! Fortunately it's just a typo 😉 attr(ae$AESTDY, 'formats.sas') <- "8.2"
attr(ae$STUDYID, 'formats.sas') <- "$CHAR10" |
Everything looks good!!!
Thank you so much @gorcha!!!! |
As a required part of the submission of clinical trial data to the FDA, data must be submission using SAS version 5 transport files (XPT files). In addition to the file format, there are several compliance requirements of the format of those XPT. A detailed account of these requirements can be found in the Study Data Technical Conformance Guide (https://www.fda.gov/media/88173/download) in section 3.3 (Page 14).
Similar to the support added in #562, a part of writing xpts in R we have a need for customizing formats, informats, and lengths written by xpt files. The length attribute in particular is of high priority, as users must be able to control the specified length of a variable, despite this not being applicable within R.
From my limited knowledge of C, ReadStat appears to have functionality for setting formats(
haven/src/DfWriter.cpp
Line 211 in 1b6db6b
Right now it looks like
write_xpt()
saves formats and informats as$
( #456) but the ability to change that with attr(df$my_var, “format”) <- “DATE9” or something similar would close a gap in our use of R in our workflows.Lengths for numeric characters should generally remain 8, but a custom length for character variables is necessary and critical for clinical submissions. This functionality is included in similar packages( r-gregmisc/SASxport#20), but given {haven}'s inclusion within the tidyverse, adopting this functional would be of immense value for pharmaceutical industry.
Unfortunately, I don’t have the C skills to parse through a lot of the src files so I wouldn’t be able to assist in that part, but I could get the process of updating the man pages and R functionality if that would be of help.
To clarify the enhancements are:
· Configurable length xpt metadata output via the length attribute for character variables.
· Configurable format xpt metadata output via the “format” attribute.
· Configurable informat xpt metadata output via the informat attribute.
The text was updated successfully, but these errors were encountered: