Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for custom formats, informats, and lengths #650

Closed
elimillera opened this issue Nov 17, 2021 · 21 comments
Closed

Add support for custom formats, informats, and lengths #650

elimillera opened this issue Nov 17, 2021 · 21 comments
Labels
feature a feature request or enhancement

Comments

@elimillera
Copy link

As a required part of the submission of clinical trial data to the FDA, data must be submission using SAS version 5 transport files (XPT files). In addition to the file format, there are several compliance requirements of the format of those XPT. A detailed account of these requirements can be found in the Study Data Technical Conformance Guide (https://www.fda.gov/media/88173/download) in section 3.3 (Page 14).

Similar to the support added in #562, a part of writing xpts in R we have a need for customizing formats, informats, and lengths written by xpt files. The length attribute in particular is of high priority, as users must be able to control the specified length of a variable, despite this not being applicable within R.

From my limited knowledge of C, ReadStat appears to have functionality for setting formats(

const char* var_format(cpp11::sexp x, VarType varType) {
and WizardMac/ReadStat#233) and newish functionality for informats( WizardMac/ReadStat#207)

Right now it looks like write_xpt() saves formats and informats as $ ( #456) but the ability to change that with attr(df$my_var, “format”) <- “DATE9” or something similar would close a gap in our use of R in our workflows.

Lengths for numeric characters should generally remain 8, but a custom length for character variables is necessary and critical for clinical submissions. This functionality is included in similar packages( r-gregmisc/SASxport#20), but given {haven}'s inclusion within the tidyverse, adopting this functional would be of immense value for pharmaceutical industry.

Unfortunately, I don’t have the C skills to parse through a lot of the src files so I wouldn’t be able to assist in that part, but I could get the process of updating the man pages and R functionality if that would be of help.

To clarify the enhancements are:
· Configurable length xpt metadata output via the length attribute for character variables.
· Configurable format xpt metadata output via the “format” attribute.
· Configurable informat xpt metadata output via the informat attribute.

@mstackhouse
Copy link

Just want to state that adding this support would really be game changing within pharma. There's a misconception that XPT files must be created in SAS. While SASXport can generate compliant XPTs, its performance is an issue, but fixes these issues within {haven} would bring this capability into the tidyverse, and make using R more accessible regulatory submission activities.

@gorcha gorcha added the feature a feature request or enhancement label Nov 18, 2021
@gorcha
Copy link
Member

gorcha commented Nov 18, 2021

Hey @elimillera, thanks for the feature request!

I'll have a look, none of this sounds particularly complicated.

haven currently allows you to set formats using the "format.<vendor>" attribute (this allows you to set formats differently for different file types, although this won't set the informat attribute).

Have you tried using this for format setting?

attr(df$my_var, "format.sas") <- "DATE9"

@MichaelRimler
Copy link

We are very much a {tidyverse} shop as a first stop for solutions - it would be tremendous if {haven} could close this gap for users creating compliant XPTs for regulatory submissions.

@nicholas-masel
Copy link

Jumping on the bandwagon here to show support that these features would be very helpful and used by many in pharma as we transition to open-source ways of meeting FDA requirements.

@thomas-neitmann
Copy link

thomas-neitmann commented Nov 18, 2021

Having this functionality in {haven} would boost adoption of R in Pharma even further as execuses such as "R cannot create submission ready datasets" become invalid. Please implement this feature!

@rossfarrugia
Copy link

+1 for the need for this from the pharma industry - thanks for requesting Eli!

@gorcha
Copy link
Member

gorcha commented Nov 19, 2021

Hi all,

To confirm, as noted above there is a mechanism for setting formats in haven already, by setting the "format.<vendor>" attribute - format.stata, format.spss, or format.sas depending on the file type. This is set by haven when reading files in, and is used to set formats when writing files out.

See the example above:

attr(df$my_var, "format.sas") <- "DATE9"

When writing xpt files the format.sas attribute is used to define the variable format - ReadStat uses this variable format for both the format and informat values, so there's no mechanism available to set them independently as it stands.

Having said that, I've had a play and it looks like there's a bug in ReadStat so formats don't always write out correctly, will have a closer look into it but it doesn't seem too difficult to fix.

Currently the variable length is set to 8 for numeric variables and the maximum string length for character variables.
Again it shouldn't be too difficult to add in a variable width attribute that allows users to override the defaults.

@MichaelRimler
Copy link

How might one write a null vector with write_xpt that does not result in a length of zero? Our experience trying to bring that data into a SAS environment via PROC COPY has been that, well, PROC COPY does not like zero length vectors. Our workaround was to drop the null vectors before writing, but that is not ideal and is.na is not a sufficient condition for identifying all vectors that result in a zero length variable in the resulting XPT.

@gorcha
Copy link
Member

gorcha commented Nov 21, 2021

Hi @MichaelRimler, do you mean writing a data frame containing character vectors for which all records are blank?

Currently this would write a length of zero since that is the max string length. I think to make SAS happy we need a min length of 1 here.

To identify vectors that will have a length of zero in the mean time you can just check the max string length, something like this will do it:

mtcars$blank_string <- ""

lapply(mtcars, function(x) { max(nchar(x)) })
#> $mpg
#> [1] 4
#> 
#> $cyl
#> [1] 1
#> 
#> $disp
#> [1] 5
#> 
#> $hp
#> [1] 3
#> 
#> $drat
#> [1] 4
#> 
#> $wt
#> [1] 5
#> 
#> $qsec
#> [1] 5
#> 
#> $vs
#> [1] 1
#> 
#> $am
#> [1] 1
#> 
#> $gear
#> [1] 1
#> 
#> $carb
#> [1] 1
#> 
#> $blank_string
#> [1] 0

Created on 2021-11-21 by the reprex package (v2.0.1)

@MichaelRimler
Copy link

@gorcha Yes, length of 1 would be good. In one of our workflows, we need to go between R and SAS and XPT is the cleanest intermediary to move from data frame to sas7bdat. But, when a data frame has such a vector, though write_xpt has no challenge, PROC COPY does not like the transfer back. Our solution (workaround) is to drop the vectors before writing, but it is not ideal.

Eliminating the possibility that an XPT is created with length zero would resolve the problem as we’ve experienced it.

@gorcha
Copy link
Member

gorcha commented Nov 21, 2021

Format writing fix PR in WizardMac/ReadStat#258

@mstackhouse
Copy link

@gorcha whenever these feature updates are ready, we're happy to put together some practical testing to ensure that the Pharma needs are met. There's some compliance testing software that's used prior to submissions (called Pinnacle 21), which validates the XPT compliance along with our industry specific data standards. That combined with the handoff back and forth between R and SAS described by @MichaelRimler will verify everything we need.

@gorcha
Copy link
Member

gorcha commented Nov 24, 2021

Thanks @mstackhouse!

There were a couple of changes required in ReadStat to fix up XPT format writing, and it'll take a little while for these changes to flow down stream to haven. I'll let you know once I've got something ready for testing.

@mstackhouse
Copy link

@gorcha it looks like your changes in ReadStat were merged. I'm curious - are there other things that need to follow for this to flow downstream into haven?

@gorcha
Copy link
Member

gorcha commented Dec 20, 2021

Hi @mstackhouse, this has been merged into the dev branch in ReadStat, but we generally wait for a ReadStat release before merging in to haven (unless it's a small/simple change) to make sure we don't diverge from the upstream code base. The releases are relatively frequent, but it depends on what else is going on in ReadStat land.

@mstackhouse
Copy link

@gorcha Thanks! That's exactly what I was interested in - what milestones haven needs from ReadStat for those changes to flow upstream. Thanks for following up!

@gorcha
Copy link
Member

gorcha commented Dec 21, 2021

No worries!

@gorcha
Copy link
Member

gorcha commented Feb 24, 2022

Hi @mstackhouse and @elimillera, I've pulled in the latest release candidate from ReadStat and merged in the changes on the write_xpt branch if you'd like to test it out with your compliance testing software - you can install from the branch using remotes::install_github("tidyverse/haven@write_xpt").

Both formats and informats are set using the format.sas attribute, and variable lengths can be set using the width attribute.

For e.g.:

library(haven)

df <- data.frame(
  char_var = "Hello!"
)

attr(df$char_var, "format.sas") <- "$CHAR10"
attr(df$char_var, "width") <- 10

I've also updated the minimum default string length to 1, so a blank character vector will have length 1 instead of 0 by default.

@mstackhouse
Copy link

Hi @gorcha

So good news - I was able get lengths to all set appropriately, and verified it in SAS and I was able to get the dataset through the compliance checks. But formats and informats aren't showing up. Here's my code:

library(dplyr)
library(haven)

ae <- haven::read_xpt(url("https://github.com/phuse-org/TestDataFactory/raw/main/Updated/TDF_SDTM/ae.xpt"))

apply_lengths <- function(.data) {
  
  types <- purrr::map_chr(.data, class)
  lengths <- ifelse(types == "character", 200, 8)
  
  purrr::walk2(names(lengths), lengths, ~ {attr(.data[[.x]], "width") <<- .y})
  
  .data
}       

ae <- ae %>% 
  apply_lengths()

attr(ae$AESTDY, 'formats.sas') <- "8.2"
attr(ae$STUDYID, 'formats.sas') <- "$CHAR10"

haven::write_xpt(ae, './ae.xpt')

The package version is installed appropriately:

> packageVersion('haven')
[1] ‘2.4.3.9001

And I can see the attributes, which I believe are set correctly:

> attr(ae$STUDYID, 'formats.sas')
[1] "$CHAR10"
> attr(ae$AESTDY, 'formats.sas')
[1] "8.2"

@gorcha
Copy link
Member

gorcha commented Feb 24, 2022

Hi @mstackhouse, thanks for testing!

Fortunately it's just a typo 😉
These lines should use the attribute name format.sas instead of formats.sas:

attr(ae$AESTDY, 'formats.sas') <- "8.2"
attr(ae$STUDYID, 'formats.sas') <- "$CHAR10"

@mstackhouse
Copy link

Everything looks good!!!

  • I confirmed the that the informats/formats can write
  • Everything opens up perfectly fine in SAS Universal Viewer
  • I confirmed the default character vector writing to a length of 1
  • All the validation checks through Pinnacle 21 pass in reference to the integrity of the dataset (meaning basically, I see what I'd expect to see).

Thank you so much @gorcha!!!!

gorcha added a commit that referenced this issue Feb 28, 2022
* Update to dev readstat for xpt format fixes (#650). Maintains iconv hack from c1f9f19 and solaris hack from 4a878a1.
@gorcha gorcha closed this as completed in 455e206 Mar 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

7 participants