Skip to content

Standardize Locations Exclusion-Inclusion Logic#178

Open
O957 wants to merge 18 commits intomainfrom
174-standardize-location-inclusionexclusion-logic-across-codebase
Open

Standardize Locations Exclusion-Inclusion Logic#178
O957 wants to merge 18 commits intomainfrom
174-standardize-location-inclusionexclusion-logic-across-codebase

Conversation

@O957
Copy link
Copy Markdown
Collaborator

@O957 O957 commented Mar 16, 2026

In short this PR removes included_locations throughout the codebase (with updates to rely on excluded_locations) and adds a helper function apply_location_exclusions for deduplication of location-exclusioning. The spirit of this PR is to rely less on the fragile state of having both included_locations and excluded_locations.

More specifically, this PR:

  • Removes any reference to included_locations from the codebase, including in constants.R.
  • Renames some excluded_locations references to default_excluded_locations and switched from FIPS codes ("78", "74", "69", "66", "60") to state abbreviations ("VI", "GU", "AS", "MP", "UM") (using forecasttools ofc).
  • Adds a file location_exclusions.R for functions in utils.R pertaining to location exclusions and for the new helper function apply_location_exclusions(), which centralizes exclusion filtering (it handles three input types (NULL, character vector, named list for target-specific exclusions) and replaces the inline dplyr::filter(.data$location %in% !!included_locations) calls).
  • Updates functions to take excluded_locations = hubhelpr::default_excluded_locations instead of included_locations or excluded_locations = NULL.
  • Updates the two GitHub Actions (generate-viz-data, update-target-data) to accept an excluded_locations JSON.
  • Updated tests w/ new parameter names and defaults.

@O957 O957 linked an issue Mar 16, 2026 that may be closed by this pull request
@O957 O957 self-assigned this Mar 16, 2026
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 16, 2026

Codecov Report

❌ Patch coverage is 26.47059% with 75 lines in your changes missing coverage. Please review.
✅ Project coverage is 16.05%. Comparing base (c50e9dc) to head (d9a6ac1).

Files with missing lines Patch % Lines
R/location_exclusions.R 35.38% 42 Missing ⚠️
R/write_webtext.R 0.00% 20 Missing ⚠️
R/write_viz_target_data.R 0.00% 9 Missing ⚠️
R/summarize_ref_date_forecasts.R 0.00% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #178      +/-   ##
==========================================
+ Coverage   14.98%   16.05%   +1.07%     
==========================================
  Files          14       15       +1     
  Lines        1335     1389      +54     
==========================================
+ Hits          200      223      +23     
- Misses       1135     1166      +31     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dylanhmorris
Copy link
Copy Markdown
Contributor

Please fix conflicts @O957

O957 and others added 2 commits March 16, 2026 14:48
…ion-inclusionexclusion-logic-across-codebase
Co-authored-by: Dylan H. Morris <dylanhmorris@users.noreply.github.com>
@O957 O957 requested a review from dylanhmorris March 16, 2026 18:49
Comment on lines +95 to +99
apply_location_exclusions <- function(
data,
excluded_locations,
supported_targets = NULL
) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sbidari lmk if you disagree, but I don't think we should try to have one function that does target and non-target specific exclusions for target-specific datasets and non-target-specific exclusions for non-target-specific datasets.

Where specifically is the second behavior needed/used? That will help us think about what the behavior should be and how it should be implemented. e.g. there's some argument that check_hospital_reporting_latency should accept things of the form

list("all" = "AA", "wk inc covid hosp" = "BB", "wk inc covid prop ed visits" = c("CC", "DD"))

and under the hood exclude AA, CC, and DD, since even though there's no target column in the dataset being filtered, it "knows" that this is about the admissions target.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I completely understand the suggestion/comment.
Are you saying apply_location_exclusions should either only be used for multitarget datasets or should be able to selectively exclude intended locations from a single-target dataset w/o target column?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As written, apply_location_exclusions has meaningfully different behavior depending on whether supported_targets is NULL:

  • If NULL:

    • filters only on the location column. The target column need not be presented and is ignored if it is present.
    • Uses only the exclusions specified in "all". Silently ignores any other exclusions.
  • If not NULL:

    • Filters jointly on the target and location columns (via anti-join). Both columns must be present.
    • Applies exclusions specified in "all" and under target-specific names, if any. Errors if there are any names in the specified exclusions that are neither "all" nor the name of a valid target.

I think we want to avoid this. I'm concerned that the change in function behavior is a bit opaque and could lead to user error.

In order to decide what the best alternative is, I think we need to understand more from @O957 about the intended use for each of the two implemented behaviors.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pursued two separate functions:

  • apply_target_location_exclusions(data, excluded_locations, supported_targets): only filtering function; requires supported_targets; always filters on both target and location via anti-join; plain character vectors are normalized, default excludes from all targets.
  • flatten_excluded_locations(excluded_locations): data reshaping utility; collapses any exclusion (character vector or named list) into a flat character vector of unique abbreviations; no filtering, no data frame involved.

Old apply_location_exclusions() was removed, i.e. no longer a function that filters only on location while ignoring target.

For check_hospital_reporting_latency: doesn't filter a data frame, rather checks whether hospitals reported on time; needs a flat list of locations to skip, nothing w/ targets, so compute_target_webtext_values calls flatten_excluded_locations(excluded_locations) before passing to it, e.g. list("all" = "AA", "wk inc covid hosp" = "BB", "wk inc covid prop ed visits" = c("CC", "DD")) flattens to c("AA", "BB", "CC", "DD")

General pattern: rather skip check than falsely flag; check_hospital_reporting_latency is specifically about the admissions target , so target filtering inside it seems not warrented.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In writing this, I noticed that e.g.

list(
    "all" = "VI",
    "wk inc covid prop ed visits" = "GU"
  )

produces c("VI", "GU") from flatten_excluded_locations. If passed to check_hospital_reporting_latency, GU gets excluded even though it was only meant to be excluded from ed visits.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think removing excluded_locations argument from check_hospital_reporting_latency is reasonable here. This way, we always check latency for expected_locations = forecasttools::us_location_table$code. If there is a data issue with one of the locations leading it to be excluded from hub reports, having it included in hospital reporting flag is fine (desirable)?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think this function is misnamed #188

@sbidari
Copy link
Copy Markdown
Collaborator

sbidari commented Mar 16, 2026

In the current implementation, if a user wants to exclude a location say "MA", they need to enter excluded_locations = set(hubhelpr::default_excluded_locations, "MA"), right?

Can we instead add the hubhelpr::default_excluded_locations to user supplied excluded_locations internally?

@O957
Copy link
Copy Markdown
Collaborator Author

O957 commented Mar 16, 2026

I can do this (seems more sensible, now on second thought).

Can we instead add the hubhelpr::default_excluded_locations to user supplied excluded_locations internally?

Thank you both so far for your considerations. I will wait to see if DHM has any further realizations regarding expected behavior considerations.

@dylanhmorris
Copy link
Copy Markdown
Contributor

dylanhmorris commented Mar 16, 2026

I agree with @sbidari that the standard interface we want is "default exclusions + 0 or more one-off exclusions". We possibly want the additional option to turn off the default exclusions, but that should be a separate step.

A (new) idea for how to handle this. What if we remove default exclusions from hubhelpr entirely? Instead, handle both default and additional one-off user-requested exclusions in the individual downstream repositories that use hubhelpr and its actions. This has a couple potential advantages:

  • Defaults can be different for different repos. You can have a location/target pair excluded by default in one place but not in another. Yes, this means some repeating yourself. But that's worth it if we can also conceive of cases where we'd want different defaults by repo. I think there are some. For example, if a target/location pair has newly been added to a Hub, you can imagine wanting it excluded by default from the reports (i.e. a default exclude in cfa-forecast-hub-reports) but included by default in the Hub's own workflows.
  • I think it will be easier to implement an ergonomic interface for mixing default exclusions and one-off exclusions at the level of individual user-facing github workflows and/or exclusion files. The all hubhelpr functions can keep their existing single excluded_locations argument.

@sbidari
Copy link
Copy Markdown
Collaborator

sbidari commented Mar 16, 2026

I agree with @sbidari that the standard interface we want is "default exclusions + 0 or more one-off exclusions". We possibly want the additional option to turn off the default exclusions, but that should be a separate step.

A (new) idea for how to handle this. What if we remove default exclusions from hubhelpr entirely? Instead, handle both default and additional one-off user-requested exclusions in the individual downstream repositories that use hubhelpr and its actions. This has a couple potential advantages:

  • Defaults can be different for different repos. You can have a location/target pair excluded by default in one place but not in another. Yes, this means some repeating yourself. But that's worth it if we can also conceive of cases where we'd want different defaults by repo. I think there are some. For example, if a target/location pair has newly been added to a Hub, you can imagine wanting it excluded by default from the reports (i.e. a default exclude in cfa-forecast-hub-reports) but included by default in the Hub's own workflows.
  • I think it will be easier to implement an ergonomic interface for mixing default exclusions and one-off exclusions at the level of individual user-facing github workflows and/or exclusion files. The all hubhelpr functions can keep their existing single excluded_locations argument.

This sounds good to me.

O957 added 3 commits March 18, 2026 09:22
…s-codebase' of O957:CDCgov/hubhelpr into 174-standardize-location-inclusionexclusion-logic-across-codebase
…ion-inclusionexclusion-logic-across-codebase
@O957 O957 requested a review from dylanhmorris March 18, 2026 18:31
@O957
Copy link
Copy Markdown
Collaborator Author

O957 commented Mar 18, 2026

Accidentally went to this PR for re-review; not ready for re-review yet, apologies.

@O957
Copy link
Copy Markdown
Collaborator Author

O957 commented Mar 18, 2026

PRs in covid19-forecast-hub, rsv-forecast-hub, and cfa-forecast-hub-reports coming soon.

Also: I think I go everything with Cmd-Shift-F but let me know if you see any missed exclusions changes that need to be made.

@O957
Copy link
Copy Markdown
Collaborator Author

O957 commented Mar 18, 2026

Needed one more changes in tests. Now

> devtools::test()
ℹ Testing hubhelpr
Registered S3 method overwritten by 'tsibble':
  method               from 
  as_tibble.grouped_df dplyr
Registered S3 method overwritten by 'epipredict':
  method            from   
  print.step_naomit recipes
✔ | F W  S  OK | Context
✔ |         20 | update_hub_target_data [1.7s]                                                                                                               

══ Results ══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
Duration: 1.7 s

[ FAIL 0 | WARN 0 | SKIP 0 | PASS 20 ]

You are a coding rockstar!

@O957
Copy link
Copy Markdown
Collaborator Author

O957 commented Mar 24, 2026

Previous Situation:

apply_location_exclusions() handled location exclusions: took a flat character vector of state/territory abbreviations and removed those from all targets uniformly; no way to exclude a location from just one target (e.g., wk inc covid hosp) but keep it for another (e.g., wk inc covid prop ed visits). Data functions (get_hubverse_format_nhsn_data, get_hubverse_format_nssp_data) applied exclusions independently (each of them) pre-data-combination.

Changes:

  • Adds new exclusion infrastructure (R/location_exclusions.R): normalize_excluded_locations(), which accepts NULL, a character vector, or a named list and normalizes to a named list format (i.e. character vectors become list("all" = ...)); build_exclusion_df(), which expands the named list into a target/location tibble for filtering and validates target names w/ the hub config; flatten_excluded_locations(), which collapses any exclusion specifications back to a flat character vector, for situations don't have target-level granularity (e.g., hospital reporting latency); apply_target_location_exclusions(), which filters a data frame by anti-joining on both target and location.
  • Removes the function apply_location_exclusions().
  • Moves exclusion filtering up in update_hub_target_data, i.e. get_hubverse_format_nhsn_data andget_hubverse_format_nssp_data now do not accept or apply excluded_locations; now update_hub_target_data filters once on the combined data (this is same pattern that is in write_viz_target_data).
  • Updates downstream functions w/ new NULL, character vector, named list format for excluded_locations.
  • Removes default exclusions, now expecting them to come at call-time.
  • Updates excluded_locations description in (actions/generate-viz-data/action.yaml).

Workflows:

For hub repositories (covid19-forecast-hub, rsv-forecast-hub, cfa-forecast-hub-reports): exclusions specified as JSON in workflow YAML (e.g. excluded_locations: '["VI","GU","AS","MP","UM"]') and read as R list or character vector.

In one case, actions/update-target-data/action.yaml calls update_hub_target_data(), and in another case, actions/generate-viz-data/action.yaml calls (write_ref_date_summary_*(), write_viz_target_data(), write_webtext()).

Then, all functions accept excluded_locations as (NULL, character vector, named list); these are normalized via normalize_excluded_locations() (e.g. c("VI","GU") → list(all = c("VI","GU")) and list(all=..., target=...) stays the same); then, in the case of check_hospital_reporting_latency, flatten_excluded_locations is used, otherwise build_exclusion_df expands to tibble; then apply_target_location_exclusions() is applied.

This is used by summarize_ref_date_forecasts(), write_viz_target_data(), update_hub_target_data() and not used by write_ref_date_summary[_ens|_all]() (delegates to summarize_ref_date_forecasts()) and write_webtext() (exclusions has happened already via summarize_ref_date_forecasts).

@O957
Copy link
Copy Markdown
Collaborator Author

O957 commented Mar 24, 2026

Why:

Now:

  • no hardcoded default exclusions (no exclusion or inclusion constants in hubhelpr)
  • single filtering function (apply_target_location_exclusions) handles both uniform (character vector) and target-specific (named list) exclusions.
  • filtering after not before data combination (now in update_hub_target_data too as with write_viz_target_data).

Comment on lines +20 to +25
purrr::walk(excluded_locations, function(x) {
checkmate::assert_character(
x,
.var.name = "excluded_locations list values"
)
})
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just checks if list values are characters. If you want to validate input, you should check values are valid character inputs (forecasttools::us_location_table$abbr)

Comment on lines +300 to +306
supported_targets <- get_hub_supported_targets(base_hub_path)
new_data <- apply_target_location_exclusions(
new_data,
excluded_locations,
supported_targets
)
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel somewhat strongly that filters for target data should be based on inclusions (i.e. what locations we want to pull from the ones available upstream). One of the reasoning is that exclusion-based filtering is not robust to changes in upstream data (NHSN and NSSP). For instance, HSA regions were added to NHSN data some time ago which broke our pipeline as we were only excluding US territories.

Something like this is needed here:

included_locations <- setdiff(
  forecasttools::us_location_table$code,
  excluded_locations
)

@O957 O957 requested a review from sbidari March 24, 2026 20:33
@sbidari
Copy link
Copy Markdown
Collaborator

sbidari commented Mar 24, 2026

Hi @O957 can you explain in the comments what approach you took (#178 (comment) and #178 (comment))? That will make it easier to review

O957 added 2 commits March 24, 2026 18:34
…ion-inclusionexclusion-logic-across-codebase
…ion-inclusionexclusion-logic-across-codebase
@O957
Copy link
Copy Markdown
Collaborator Author

O957 commented Mar 25, 2026

Re: inclusion-based filtering for upstream data...

I made it so that functions pull from data.cdc.gov (update_hub_target_data, write_viz_target_data when use_hub_data=FALSE) use filter_to_included_locations(), which does what the setdiff pattern you described:

included_codes <- setdiff(
  forecasttools::us_location_table$code,
  excluded_codes
)

Runs per-target (for target-specific exclusions) and keeps only rows with locations in the included set. Hub data (write_viz_target_data with use_hub_data=TRUE, summarize_ref_date_forecasts) still uses apply_target_location_exclusions since hub data is already validated.

Re: check_hospital_reporting_rate and excluded_locations

Kept excluded_locations on this instead of removing so it's consistent rest of hubhelpr + it already did inclusion-based filtering internally (expected_locations <- setdiff(us_location_table$code, excluded_codes)). I changed: compute_target_webtext_values now extracts only the exclusions relevant to the hosp target via get_target_exclusions(normalized, target) before it passes a flat character vector. So list("all" = "VI", "wk inc covid prop ed visits" = "GU") now works as intended (see #178 (comment)).

return(data)
}

data_targets <- unique(data$target)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not hub supported targets here? as before?

Comment on lines +122 to +131
exclusion_df <- purrr::map_df(data_targets, \(tgt) {
excl_abbrs <- get_target_exclusions(normalized, tgt)
if (length(excl_abbrs) == 0) {
return(tibble::tibble(target = character(), location = character()))
}
tibble::tibble(
target = tgt,
location = forecasttools::us_location_recode(excl_abbrs, "abbr", "hub")
)
})
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will implicitly handle empty rows

Suggested change
exclusion_df <- purrr::map_df(data_targets, \(tgt) {
excl_abbrs <- get_target_exclusions(normalized, tgt)
if (length(excl_abbrs) == 0) {
return(tibble::tibble(target = character(), location = character()))
}
tibble::tibble(
target = tgt,
location = forecasttools::us_location_recode(excl_abbrs, "abbr", "hub")
)
})
exclusion_df <- exclusion_df <- dplyr::tibble(target = data_targets) |>
dplyr::mutate(
location = purrr::map(
target,
\(tgt) forecasttools::us_location_recode(
get_target_exclusions(normalized, tgt),
"abbr",
"hub"
)
)
) |>
tidyr::unnest_longer(location)

Comment on lines +87 to +97
if (use_hub_data) {
target_data <- apply_target_location_exclusions(
target_data,
excluded_locations
)
} else {
target_data <- filter_to_included_locations(
target_data,
excluded_locations
)
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this into existing if-else structure above

Comment on lines +26 to +39
if (!is.null(excluded_locations) && length(excluded_locations) > 0) {
excluded_codes <- forecasttools::us_location_recode(
excluded_locations,
"abbr",
"hub"
)
} else {
excluded_codes <- character(0)
}
expected_locations <- setdiff(
forecasttools::us_location_table$code,
excluded_codes
)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't work when excluded_locations is a named list. Why is excluded_locations expected to be a character vector here?

Comment on lines +155 to +176
filter_to_included_locations <- function(
data,
excluded_locations
) {
normalized <- normalize_excluded_locations(excluded_locations)
all_valid_codes <- forecasttools::us_location_table$code

purrr::map_df(unique(data$target), \(tgt) {
if (!is.null(normalized)) {
excl_abbrs <- get_target_exclusions(normalized, tgt)
excl_codes <- forecasttools::us_location_recode(
excl_abbrs,
"abbr",
"hub"
)
included_codes <- setdiff(all_valid_codes, excl_codes)
} else {
included_codes <- all_valid_codes
}
dplyr::filter(data, .data$target == tgt, .data$location %in% included_codes)
})
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a mismatch here between function name and argument. Name of the function makes it easy to misinterpret the argument.
I think rename the function name to filter_to_expected_location that takes in expected locations (default: forecasttools::us_location_table$code) and excluded locations (default: NULL).

Then create expected_df and exclusion_df

  expected_df <- tidyr::crossing(
    target = get_hub_supported_targets(),
    location = forecasttools::us_location_table$code
  )

  exclusion_df <- exclusion_df <- dplyr::tibble(target = data_targets) |>
    dplyr::mutate(
      location = purrr::map(
        target,
        \(tgt) forecasttools::us_location_recode(
          get_target_exclusions(normalized, tgt),
          "abbr",
          "hub"
        )
      )
    ) |>
    tidyr::unnest_longer(location)

  expected_target_location_df <- dplyr::anti_join(
    expected_df, exclusion_df
  )
  filtered <- dplyr::inner_join(
    data,
    expected_target_location_df,
    by = c("target", "location")
  )

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love this, but I think it's clearer than the current approach and has same approach as apply_exclusion (https://github.com/CDCgov/hubhelpr/pull/178/files#r2990093161)
Open to other ideas if you have any?

@O957
Copy link
Copy Markdown
Collaborator Author

O957 commented Mar 26, 2026

I have responses to some of the questions but will wait to explain in detail until I see the lay of the land once #196 is merged.

O957 added 2 commits March 27, 2026 09:31
…ion-inclusionexclusion-logic-across-codebase
…ion-inclusionexclusion-logic-across-codebase
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Standardize location inclusion/exclusion logic across codebase.

4 participants