-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TADAOutliers #47
Comments
We need to be cautious about removal of outliers in environmental datasets. This would only provide an option to review and remove data that are different than approximately 99% of the data available for a given parameter and unit combination. This is only to try to catch invalid data - many outliers are still valid results. The tool would provide an option to flag data that falls above or below these values: Jim Hagy (see TADA Working Group notes: https://usepa.sharepoint.com/:w:/r/sites/AutomatedDataAnalysisWorkingGroup/_layouts/15/Doc.aspx?sourcedoc=%7BC74D9A1C-DCEE-46B1-AC07-E05AD63E2714%7D&file=IssuePaper_RetrievalQAQC_Jan2021.docx&action=default&mobileredirect=true): If would be useful to be able to select whether this flagging process is applied to the original data or the log of the data. For data that are strongly log-normally distributed, many valid observations will be >1.5*IQR above the 75th percentile. But if you applied those percentiles to the logs, it would be a different story. This is one place, where the distribution charts become helpful. We could apply the outlier test to original data or log of the data depending on the data distribution. See examples in CDC app: https://ergapps.shinyapps.io/atsdrepc/ |
This topic could potentially be related to the censored data method used for each characteristic (but feel free to move this to a new issue): Example..... Cristina- is 1/x useful? |
This issue is related to the TADA Shiny issue and pending development of an outlier tab: USEPA/TADAShiny#137 |
A few existing packages related to outliers:
@cristinamullin are there any notes from previous working group discussions that might be helpful for me to review on this topic? @wokenny13 the EnvStats package might be useful to check out for some of the mod 3 functions. |
Consider adding outlier information to TADA stats function.
Append one or two additional columns to the dataset flagging outliers at the individual station/char level and/or at the all stations/char level.
Add new function input for stats to flag outliers across single station (input ID) or all stations:
Scale = AllStations
Scale = IndividualStations
The text was updated successfully, but these errors were encountered: