Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TADAOutliers #47

Open
cristinamullin opened this issue Apr 15, 2022 · 4 comments
Open

TADAOutliers #47

cristinamullin opened this issue Apr 15, 2022 · 4 comments
Labels

Comments

@cristinamullin
Copy link
Collaborator

cristinamullin commented Apr 15, 2022

Consider adding outlier information to TADA stats function.

Append one or two additional columns to the dataset flagging outliers at the individual station/char level and/or at the all stations/char level.

Add new function input for stats to flag outliers across single station (input ID) or all stations:
Scale = AllStations
Scale = IndividualStations

@cristinamullin
Copy link
Collaborator Author

cristinamullin commented Apr 14, 2023

We need to be cautious about removal of outliers in environmental datasets.

This would only provide an option to review and remove data that are different than approximately 99% of the data available for a given parameter and unit combination. This is only to try to catch invalid data - many outliers are still valid results.

The tool would provide an option to flag data that falls above or below these values:
Upper Outlier = 75th Percentile + 1.5 * (75th percentile - 25th percentile)
Lower Outlier = 25th Percentile - 1.5 * (75th percentile - 25th percentile)

Jim Hagy (see TADA Working Group notes: https://usepa.sharepoint.com/:w:/r/sites/AutomatedDataAnalysisWorkingGroup/_layouts/15/Doc.aspx?sourcedoc=%7BC74D9A1C-DCEE-46B1-AC07-E05AD63E2714%7D&file=IssuePaper_RetrievalQAQC_Jan2021.docx&action=default&mobileredirect=true): If would be useful to be able to select whether this flagging process is applied to the original data or the log of the data. For data that are strongly log-normally distributed, many valid observations will be >1.5*IQR above the 75th percentile. But if you applied those percentiles to the logs, it would be a different story.

This is one place, where the distribution charts become helpful. We could apply the outlier test to original data or log of the data depending on the data distribution. See examples in CDC app: https://ergapps.shinyapps.io/atsdrepc/

@cristinamullin
Copy link
Collaborator Author

cristinamullin commented Apr 14, 2023

This topic could potentially be related to the censored data method used for each characteristic (but feel free to move this to a new issue):

Example.....

Cristina- is 1/x useful?
Lesley Merrick (OR) - they use it when the detection limit (or ½ detection limit) is above the water quality standard, particularly when using geomean. This is our white paper on using censored data in the IR. https://www.oregon.gov/deq/FilterDocs/iriCensoredData.pdf

@cristinamullin
Copy link
Collaborator Author

This issue is related to the TADA Shiny issue and pending development of an outlier tab: USEPA/TADAShiny#137

@hillarymarler
Copy link
Collaborator

A few existing packages related to outliers:

  1. envoutliers: Methods for Identification of Outliers in Environmental Data - https://cran.r-project.org/web/packages/envoutliers/index.html
  2. EnvStats: Package for Environmental Statistics, Including US EPA Guidance - https://cran.r-project.org/web/packages/EnvStats/index.html (some outlier functions)
  3. outliers: A collection of some tests commonly used for identifying outliers - https://cran.r-project.org/web/packages/outliers/index.html

@cristinamullin are there any notes from previous working group discussions that might be helpful for me to review on this topic?

@wokenny13 the EnvStats package might be useful to check out for some of the mod 3 functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants