Conversation
…ocations; add helper function
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #178 +/- ##
==========================================
+ Coverage 14.98% 16.05% +1.07%
==========================================
Files 14 15 +1
Lines 1335 1389 +54
==========================================
+ Hits 200 223 +23
- Misses 1135 1166 +31 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Please fix conflicts @O957 |
…ion-inclusionexclusion-logic-across-codebase
Co-authored-by: Dylan H. Morris <dylanhmorris@users.noreply.github.com>
R/location_exclusions.R
Outdated
| apply_location_exclusions <- function( | ||
| data, | ||
| excluded_locations, | ||
| supported_targets = NULL | ||
| ) { |
There was a problem hiding this comment.
@sbidari lmk if you disagree, but I don't think we should try to have one function that does target and non-target specific exclusions for target-specific datasets and non-target-specific exclusions for non-target-specific datasets.
Where specifically is the second behavior needed/used? That will help us think about what the behavior should be and how it should be implemented. e.g. there's some argument that check_hospital_reporting_latency should accept things of the form
list("all" = "AA", "wk inc covid hosp" = "BB", "wk inc covid prop ed visits" = c("CC", "DD"))and under the hood exclude AA, CC, and DD, since even though there's no target column in the dataset being filtered, it "knows" that this is about the admissions target.
There was a problem hiding this comment.
I am not sure I completely understand the suggestion/comment.
Are you saying apply_location_exclusions should either only be used for multitarget datasets or should be able to selectively exclude intended locations from a single-target dataset w/o target column?
There was a problem hiding this comment.
As written, apply_location_exclusions has meaningfully different behavior depending on whether supported_targets is NULL:
-
If
NULL:- filters only on the
locationcolumn. Thetargetcolumn need not be presented and is ignored if it is present. - Uses only the exclusions specified in
"all". Silently ignores any other exclusions.
- filters only on the
-
If not
NULL:- Filters jointly on the
targetandlocationcolumns (via anti-join). Both columns must be present. - Applies exclusions specified in
"all"and under target-specific names, if any. Errors if there are any names in the specified exclusions that are neither "all" nor the name of a valid target.
- Filters jointly on the
I think we want to avoid this. I'm concerned that the change in function behavior is a bit opaque and could lead to user error.
In order to decide what the best alternative is, I think we need to understand more from @O957 about the intended use for each of the two implemented behaviors.
There was a problem hiding this comment.
I pursued two separate functions:
apply_target_location_exclusions(data, excluded_locations, supported_targets): only filtering function; requiressupported_targets; always filters on both target and location via anti-join; plain character vectors are normalized, default excludes from all targets.flatten_excluded_locations(excluded_locations): data reshaping utility; collapses any exclusion (character vector or named list) into a flat character vector of unique abbreviations; no filtering, no data frame involved.
Old apply_location_exclusions() was removed, i.e. no longer a function that filters only on location while ignoring target.
For check_hospital_reporting_latency: doesn't filter a data frame, rather checks whether hospitals reported on time; needs a flat list of locations to skip, nothing w/ targets, so compute_target_webtext_values calls flatten_excluded_locations(excluded_locations) before passing to it, e.g. list("all" = "AA", "wk inc covid hosp" = "BB", "wk inc covid prop ed visits" = c("CC", "DD")) flattens to c("AA", "BB", "CC", "DD")
General pattern: rather skip check than falsely flag; check_hospital_reporting_latency is specifically about the admissions target , so target filtering inside it seems not warrented.
There was a problem hiding this comment.
In writing this, I noticed that e.g.
list(
"all" = "VI",
"wk inc covid prop ed visits" = "GU"
)
produces c("VI", "GU") from flatten_excluded_locations. If passed to check_hospital_reporting_latency, GU gets excluded even though it was only meant to be excluded from ed visits.
There was a problem hiding this comment.
I think removing excluded_locations argument from check_hospital_reporting_latency is reasonable here. This way, we always check latency for expected_locations = forecasttools::us_location_table$code. If there is a data issue with one of the locations leading it to be excluded from hub reports, having it included in hospital reporting flag is fine (desirable)?
|
In the current implementation, if a user wants to exclude a location say "MA", they need to enter Can we instead add the |
|
I can do this (seems more sensible, now on second thought).
Thank you both so far for your considerations. I will wait to see if DHM has any further realizations regarding expected behavior considerations. |
|
I agree with @sbidari that the standard interface we want is "default exclusions + 0 or more one-off exclusions". We possibly want the additional option to turn off the default exclusions, but that should be a separate step. A (new) idea for how to handle this. What if we remove default exclusions from hubhelpr entirely? Instead, handle both default and additional one-off user-requested exclusions in the individual downstream repositories that use hubhelpr and its actions. This has a couple potential advantages:
|
This sounds good to me. |
…s-codebase' of O957:CDCgov/hubhelpr into 174-standardize-location-inclusionexclusion-logic-across-codebase
…ion-inclusionexclusion-logic-across-codebase
|
Accidentally went to this PR for re-review; not ready for re-review yet, apologies. |
|
PRs in Also: I think I go everything with Cmd-Shift-F but let me know if you see any missed exclusions changes that need to be made. |
|
Needed one more changes in tests. Now |
…ion-inclusionexclusion-logic-across-codebase
|
Previous Situation:
Changes:
Workflows: For hub repositories (covid19-forecast-hub, rsv-forecast-hub, cfa-forecast-hub-reports): exclusions specified as JSON in workflow YAML (e.g. In one case, Then, all functions accept This is used by |
|
Why: Now:
|
| purrr::walk(excluded_locations, function(x) { | ||
| checkmate::assert_character( | ||
| x, | ||
| .var.name = "excluded_locations list values" | ||
| ) | ||
| }) |
There was a problem hiding this comment.
This just checks if list values are characters. If you want to validate input, you should check values are valid character inputs (forecasttools::us_location_table$abbr)
R/update_hub_target_data.R
Outdated
| supported_targets <- get_hub_supported_targets(base_hub_path) | ||
| new_data <- apply_target_location_exclusions( | ||
| new_data, | ||
| excluded_locations, | ||
| supported_targets | ||
| ) | ||
| } |
There was a problem hiding this comment.
I feel somewhat strongly that filters for target data should be based on inclusions (i.e. what locations we want to pull from the ones available upstream). One of the reasoning is that exclusion-based filtering is not robust to changes in upstream data (NHSN and NSSP). For instance, HSA regions were added to NHSN data some time ago which broke our pipeline as we were only excluding US territories.
Something like this is needed here:
included_locations <- setdiff(
forecasttools::us_location_table$code,
excluded_locations
)
|
Hi @O957 can you explain in the comments what approach you took (#178 (comment) and #178 (comment))? That will make it easier to review |
…ion-inclusionexclusion-logic-across-codebase
…ion-inclusionexclusion-logic-across-codebase
|
Re: inclusion-based filtering for upstream data... I made it so that functions pull from Runs per-target (for target-specific exclusions) and keeps only rows with locations in the included set. Hub data ( Re: Kept |
| return(data) | ||
| } | ||
|
|
||
| data_targets <- unique(data$target) |
There was a problem hiding this comment.
why not hub supported targets here? as before?
| exclusion_df <- purrr::map_df(data_targets, \(tgt) { | ||
| excl_abbrs <- get_target_exclusions(normalized, tgt) | ||
| if (length(excl_abbrs) == 0) { | ||
| return(tibble::tibble(target = character(), location = character())) | ||
| } | ||
| tibble::tibble( | ||
| target = tgt, | ||
| location = forecasttools::us_location_recode(excl_abbrs, "abbr", "hub") | ||
| ) | ||
| }) |
There was a problem hiding this comment.
This will implicitly handle empty rows
| exclusion_df <- purrr::map_df(data_targets, \(tgt) { | |
| excl_abbrs <- get_target_exclusions(normalized, tgt) | |
| if (length(excl_abbrs) == 0) { | |
| return(tibble::tibble(target = character(), location = character())) | |
| } | |
| tibble::tibble( | |
| target = tgt, | |
| location = forecasttools::us_location_recode(excl_abbrs, "abbr", "hub") | |
| ) | |
| }) | |
| exclusion_df <- exclusion_df <- dplyr::tibble(target = data_targets) |> | |
| dplyr::mutate( | |
| location = purrr::map( | |
| target, | |
| \(tgt) forecasttools::us_location_recode( | |
| get_target_exclusions(normalized, tgt), | |
| "abbr", | |
| "hub" | |
| ) | |
| ) | |
| ) |> | |
| tidyr::unnest_longer(location) |
| if (use_hub_data) { | ||
| target_data <- apply_target_location_exclusions( | ||
| target_data, | ||
| excluded_locations | ||
| ) | ||
| } else { | ||
| target_data <- filter_to_included_locations( | ||
| target_data, | ||
| excluded_locations | ||
| ) | ||
| } |
There was a problem hiding this comment.
Move this into existing if-else structure above
| if (!is.null(excluded_locations) && length(excluded_locations) > 0) { | ||
| excluded_codes <- forecasttools::us_location_recode( | ||
| excluded_locations, | ||
| "abbr", | ||
| "hub" | ||
| ) | ||
| } else { | ||
| excluded_codes <- character(0) | ||
| } | ||
| expected_locations <- setdiff( | ||
| forecasttools::us_location_table$code, | ||
| excluded_codes | ||
| ) | ||
|
|
There was a problem hiding this comment.
This doesn't work when excluded_locations is a named list. Why is excluded_locations expected to be a character vector here?
| filter_to_included_locations <- function( | ||
| data, | ||
| excluded_locations | ||
| ) { | ||
| normalized <- normalize_excluded_locations(excluded_locations) | ||
| all_valid_codes <- forecasttools::us_location_table$code | ||
|
|
||
| purrr::map_df(unique(data$target), \(tgt) { | ||
| if (!is.null(normalized)) { | ||
| excl_abbrs <- get_target_exclusions(normalized, tgt) | ||
| excl_codes <- forecasttools::us_location_recode( | ||
| excl_abbrs, | ||
| "abbr", | ||
| "hub" | ||
| ) | ||
| included_codes <- setdiff(all_valid_codes, excl_codes) | ||
| } else { | ||
| included_codes <- all_valid_codes | ||
| } | ||
| dplyr::filter(data, .data$target == tgt, .data$location %in% included_codes) | ||
| }) | ||
| } |
There was a problem hiding this comment.
There is a mismatch here between function name and argument. Name of the function makes it easy to misinterpret the argument.
I think rename the function name to filter_to_expected_location that takes in expected locations (default: forecasttools::us_location_table$code) and excluded locations (default: NULL).
Then create expected_df and exclusion_df
expected_df <- tidyr::crossing(
target = get_hub_supported_targets(),
location = forecasttools::us_location_table$code
)
exclusion_df <- exclusion_df <- dplyr::tibble(target = data_targets) |>
dplyr::mutate(
location = purrr::map(
target,
\(tgt) forecasttools::us_location_recode(
get_target_exclusions(normalized, tgt),
"abbr",
"hub"
)
)
) |>
tidyr::unnest_longer(location)
expected_target_location_df <- dplyr::anti_join(
expected_df, exclusion_df
)
filtered <- dplyr::inner_join(
data,
expected_target_location_df,
by = c("target", "location")
)
There was a problem hiding this comment.
I don't love this, but I think it's clearer than the current approach and has same approach as apply_exclusion (https://github.com/CDCgov/hubhelpr/pull/178/files#r2990093161)
Open to other ideas if you have any?
|
I have responses to some of the questions but will wait to explain in detail until I see the lay of the land once #196 is merged. |
…ion-inclusionexclusion-logic-across-codebase
…ion-inclusionexclusion-logic-across-codebase
In short this PR removes
included_locationsthroughout the codebase (with updates to rely onexcluded_locations) and adds a helper functionapply_location_exclusionsfor deduplication of location-exclusioning. The spirit of this PR is to rely less on the fragile state of having bothincluded_locationsandexcluded_locations.More specifically, this PR:
included_locationsfrom the codebase, including inconstants.R.excluded_locationsreferences todefault_excluded_locationsand switched from FIPS codes ("78", "74", "69", "66", "60") to state abbreviations ("VI", "GU", "AS", "MP", "UM") (usingforecasttoolsofc).location_exclusions.Rfor functions inutils.Rpertaining to location exclusions and for the new helper functionapply_location_exclusions(), which centralizes exclusion filtering (it handles three input types (NULL, character vector, named list for target-specific exclusions) and replaces the inlinedplyr::filter(.data$location %in% !!included_locations)calls).excluded_locations = hubhelpr::default_excluded_locationsinstead ofincluded_locationsorexcluded_locations = NULL.generate-viz-data,update-target-data) to accept anexcluded_locationsJSON.