Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use of Arrow within CohortDiagnostics causes issue #43

Open
azimov opened this issue Mar 14, 2023 · 5 comments
Open

Use of Arrow within CohortDiagnostics causes issue #43

azimov opened this issue Mar 14, 2023 · 5 comments

Comments

@azimov
Copy link

azimov commented Mar 14, 2023

Errors when using the develop branch of CohortDiagnostics occur. I'm not sure if this is Andromeda or the downstream use of FeatureExtraction.

Error (test-againstCdm.R:24:3): Cohort diagnostics in incremental mode
Error: Invalid: Failed to parse value: FEMALE
Backtrace:
  1. base::system.time(...)
       at test-againstCdm.R:24:2
  2. CohortDiagnostics::executeDiagnostics(...)
  5. CohortDiagnostics:::executeTimeSeriesDiagnostics(...)
       at CohortDiagnostics/R/RunDiagnostics.R:715:8
  8. CohortDiagnostics::runCohortTimeSeriesDiagnostics(...)
       at CohortDiagnostics/R/TimeSeries.R:575:12
 10. Andromeda (local) `$<-`(`*tmp*`, timeSeries, value = `<FlSystmD[,15]>`)
 12. Andromeda (local) `[[<-`(`*tmp*`, name, value = `<FlSystmD[,15]>`)
 13. Andromeda (local) .local(x, i, ..., value)
 14. arrow::write_dataset(value, file.path(attr(x, "path"), i), format = "feather")
 15. plan$Write(...)
 16. arrow:::ExecPlan_Write(...)

Error (test-moduleTimeSeries.R:86:5): Testing cohort time series execution
Error: Invalid: Failed to parse value: MALE
Backtrace:
  1. CohortDiagnostics:::executeTimeSeriesDiagnostics(...)
       at test-moduleTimeSeries.R:86:4
  4. CohortDiagnostics::runCohortTimeSeriesDiagnostics(...)
       at CohortDiagnostics/R/TimeSeries.R:575:12
  6. Andromeda (local) `$<-`(`*tmp*`, timeSeries, value = `<FlSystmD[,15]>`)
  8. Andromeda (local) `[[<-`(`*tmp*`, name, value = `<FlSystmD[,15]>`)
  9. Andromeda (local) .local(x, i, ..., value)
 10. arrow::write_dataset(value, file.path(attr(x, "path"), i), format = "feather")
 11. plan$Write(...)
 12. arrow:::ExecPlan_Write(...)

Error (test-moduleTimeSeries.R:291:5): Testing Data source time series execution
Error: Invalid: Failed to parse value: MALE
Backtrace:
  1. CohortDiagnostics:::executeTimeSeriesDiagnostics(...)
       at test-moduleTimeSeries.R:291:4
  4. CohortDiagnostics::runCohortTimeSeriesDiagnostics(...)
       at CohortDiagnostics/R/TimeSeries.R:639:8
  6. Andromeda (local) `$<-`(`*tmp*`, timeSeries, value = `<FlSystmD[,15]>`)
  8. Andromeda (local) `[[<-`(`*tmp*`, name, value = `<FlSystmD[,15]>`)
  9. Andromeda (local) .local(x, i, ..., value)
 10. arrow::write_dataset(value, file.path(attr(x, "path"), i), format = "feather")
 11. plan$Write(...)
 12. arrow:::ExecPlan_Write(...)

Steps to reproduce:

  • Install CohortDiagnostics (develop or main)
  • run unit tests (devtools::test())
  • See a number of failures.

Note that this occurs in the time series diagnostics as well as other places.

Running the time series diagnostics and breaking at line #441:

Lets you find an error of this type.

Here running resultsInAndromeda$allData %>% dplyr::collect() produces something similar.

Error in compute.Dataset(x) : Invalid: Failed to parse value: MALE

Note that gender in the SQL is taken directly from concept.concept_name a varchar/text string by definition.

Likely also reproducible when calling runCohortTimeSeriesDiagnostics directly.

@azimov
Copy link
Author

azimov commented Mar 14, 2023

Adding more to this issue:

This error also occurs.

Error (test-moduleCohortRelationship.R:363:5): Testing cohort relationship logic - incremental FALSE
Error: Invalid: Incompatible data types for corresponding join field keys: FieldRef.Name(timeId) of type double and FieldRef.Name(timeId) of type int32
Backtrace:
  1. CohortDiagnostics::runCohortRelationshipDiagnostics(...)
       at test-moduleCohortRelationship.R:363:4
  3. Andromeda (local) `$<-`(`*tmp*`, cohortRelationships, value = `<arrw_dp_[,38]>`)
  5. Andromeda (local) `[[<-`(`*tmp*`, name, value = `<arrw_dp_[,38]>`)
  6. Andromeda (local) .local(x, i, ..., value)
  7. arrow::write_dataset(value, tempDir, format = "feather")
  8. plan$Build(dataset)
  9. self$Build(.data$.data)
 10. node$Join(...)
 12. arrow:::ExecNode_Join(...)

This is likely with a joing using dplyr here.

esultsInAndromeda$cohortRelationships <-
      resultsInAndromeda$cohortRelationships %>%
      dplyr::inner_join(resultsInAndromeda$timePeriods, by = "timeId") %>%
      dplyr::select(-timeId) %>%
      dplyr::arrange(
        cohortId,
        comparatorCohortId,
        startDay,
        endDay
      )

The table cohortRelationships has a time id field that is literally,

SELECT ..., @time_id time_id ...

Looping over values from timePeriods which is assigned via the user of row_number in R.

casting time_id as int in the sql may fix this, however, I expect that being able to dynamically join to compatible types would be a desired feature as dbdplyr doesn't expect explicit type casting.

@ablack3
Copy link
Collaborator

ablack3 commented Mar 17, 2023

Thanks for reporting this!

@ablack3
Copy link
Collaborator

ablack3 commented Mar 20, 2023

@anthonysena Have you tested FeatureExtraction with the develop branch of Andromeda?

@anthonysena
Copy link

Not yet @ablack3 - trying to get the unit tests back online and then see how FE is working with the develop branch of Andromeda

@azimov
Copy link
Author

azimov commented Mar 29, 2023

@ablack3 so following up on the CohortDiagnostics Results i was able to resolve the initial issues by casting explicitly to character or integer in SQL when querying to Andromeda, forcing the arrow object to store empty values as the correct type.
Though this works I'd still note that the joins between different data types is something that will happen and could shoot us in the foot - it could be that non of our unit tests produce this situation so a bug is still lying around that will go unnoticed (e.g. a left join to an empty tibble could be enough to break things). If there was a way to resolve this internally I think it would be worthwhile.

However, when running with the latest version I get a new strange error. The problem is that there are multiple rows keys in a results object that should be generated from a group by select. (This doesn't happen in the current released version of Andromeda) and I can't figure out why.

Warning (test-5-moduleTimeSeries.R:86:5): Testing cohort time series execution
Default behavior of `pull()` on Arrow data is changing. Current behavior of returning an R vector is deprecated, and in a future release, it will return an Arrow `ChunkedArray`. To control this:
i Specify `as_vector = TRUE` (the current default) or `FALSE` (what it will change to) in `pull()`
i Or, set `options(arrow.pull_as_vector)` globally
This warning is displayed once every 8 hours.
Backtrace:
 1. CohortDiagnostics:::executeTimeSeriesDiagnostics(...)
      at test-5-moduleTimeSeries.R:86:4
 7. arrow:::pull.arrow_dplyr_query(., n)
 8. arrow:::handle_pull_as_vector(out, as_vector)

Error (test-5-moduleTimeSeries.R:86:5): Testing cohort time series execution
Error in `makeDataExportable(x = data, tableName = "time_series", minCellCount = minCellCount, 
    databaseId = databaseId)`:  - duplicates found in primary key for table time_series. The primary keys are: cohortId, databaseId, periodBegin, periodEnd, seriesType, calendarInterval, gender, ageGroup
Backtrace:
 1. CohortDiagnostics:::executeTimeSeriesDiagnostics(...)
      at test-5-moduleTimeSeries.R:86:4
 2. CohortDiagnostics:::makeDataExportable(...)
      at CohortDiagnostics/R/TimeSeries.R:590:8

Error (test-5-moduleTimeSeries.R:291:5): Testing Data source time series execution
Error in `makeDataExportable(x = data, tableName = "time_series", minCellCount = minCellCount, 
    databaseId = databaseId)`:  - duplicates found in primary key for table time_series. The primary keys are: cohortId, databaseId, periodBegin, periodEnd, seriesType, calendarInterval, gender, ageGroup
Backtrace:
 1. CohortDiagnostics:::executeTimeSeriesDiagnostics(...)
      at test-5-moduleTimeSeries.R:291:4
 2. CohortDiagnostics:::makeDataExportable(...)
      at CohortDiagnostics/R/TimeSeries.R:652:4

The above is a check that is performed when exporting data from the package - this is to ensure that the results data can be inserted into the result schemas for viewing in the shiny app etc.

To reproduce:

  1. Clone CohortDiagnostics git
  2. checkout develop branch
  3. Working - run unit tests with Sqlite Andromeda (version 0.6.3)
  4. Not working (produces above errors) - install Andromeda develop branch, run unit tests

The offending code is around here. This wasn't written by me (and I find it kind of confusingly written so this might be a tough one to work out) - it may be just an error with the combinations of Andromeda objects. However, the SQL that generates the results looks fine - it's using a GROUP BY clause so the rows in question should be aggregated (and indeed, are in the Sqlite version).

Either CohortDiagnostics is doing something bad when combining the objects like this or something is going wrong when exporting the data. I can't quite figure out why that might be, and why it works fine when using the Sqlite version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants