add code to harmonize redshift and platform groups#49
Conversation
lhoesly
commented
Feb 25, 2026
- Python code to align groups in Civis Platform with groups in redshift
- user supplies config of crosswalk of groups
- user supplies database
There was a problem hiding this comment.
Pull request overview
This pull request adds Python automation to synchronize user group memberships between Civis Platform and Redshift databases. The solution treats Platform groups as the source of truth and updates Redshift group memberships to match, addressing the challenge of maintaining consistent group memberships across two independent systems.
Changes:
- New Python script that fetches groups from Platform API and Redshift, compares memberships, and generates SQL ALTER GROUP statements to synchronize them
- Comprehensive README with configuration instructions and usage examples
- Configurable group crosswalk mapping and dry-run mode for safe testing
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 22 comments.
| File | Description |
|---|---|
| python/platform_redshift_groups/automate_groups.py | Main automation script that queries Platform and Redshift, compares group memberships, and generates/executes SQL to synchronize groups |
| python/platform_redshift_groups/README.md | Documentation covering overview, configuration, usage instructions, and requirements |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| redshift_valid_users_list = create_valid_database_users_list(database=database) | ||
| # Edit this list to specify users that should be ignored during group synchronization | ||
| ignore_users_list = ["dbadmin", "console"] | ||
| # ignore_groups_list = |
There was a problem hiding this comment.
Incomplete code with commented-out line. Either remove the comment or implement the intended ignore_groups_list functionality if it's needed for the feature.
| # ignore_groups_list = |
|
|
||
| for platform_group, redshift_group in group_names_crosswalk: | ||
| platform_group_members = platform_group_members_dict[platform_group] | ||
| platform_group_members = [x for x in platform_group_members if x in redshift_valid_users_list] |
There was a problem hiding this comment.
The ignore_users_list is only applied to Redshift group members (line 166) but not to Platform group members (line 163). This creates an asymmetry: if an ignored user like "dbadmin" is in a Platform group but not in the corresponding Redshift group, the script will try to add them. Consider applying the ignore list to both Platform and Redshift members for consistency.
| platform_group_members = [x for x in platform_group_members if x in redshift_valid_users_list] | |
| platform_group_members = [ | |
| x for x in platform_group_members | |
| if x in redshift_valid_users_list and x not in ignore_users_list | |
| ] |
| group_results = [client.groups.get(result['id']) for result in client.groups.list(limit = 1000)] | ||
|
|
There was a problem hiding this comment.
Hardcoded limit of 1000 groups may be insufficient for large organizations. If an organization has more than 1000 groups, this will silently fail to process all groups. Consider implementing pagination or at least logging a warning if the limit is reached.
| group_results = [client.groups.get(result['id']) for result in client.groups.list(limit = 1000)] | |
| # Retrieve all groups via pagination instead of relying on a single hardcoded limit. | |
| limit = 1000 | |
| page_num = 1 | |
| all_group_summaries = [] | |
| last_page_count = 0 | |
| while True: | |
| page = client.groups.list(page_num=page_num, limit=limit) | |
| if not page: | |
| break | |
| all_group_summaries.extend(page) | |
| last_page_count = len(page) | |
| if last_page_count < limit: | |
| # We've reached the final page. | |
| break | |
| page_num += 1 | |
| if last_page_count == limit: | |
| LOG.warning( | |
| "Retrieved a full page of %d groups from the Platform API on page %d. " | |
| "There may be additional groups that were not retrieved. " | |
| "Consider increasing the limit or using a more robust pagination strategy " | |
| "if your organization has more than %d groups.", | |
| limit, | |
| page_num, | |
| len(all_group_summaries), | |
| ) | |
| group_results = [client.groups.get(result['id']) for result in all_group_summaries] |
| def main(database, dry_run = True): | ||
| """ | ||
| Main function that produces and executes the SQL query to align the redshift and platform groups | ||
| """ | ||
|
|
||
| # constants and empty lists for the loop | ||
| full_add_list = [] | ||
| full_drop_list = [] | ||
| full_platform_group_change_list = [] | ||
| full_query_text = "" | ||
|
|
||
| # creating the dictionaries and lists we need | ||
| group_names_crosswalk = create_group_names_crosswalk() | ||
| platform_group_members_dict = create_platform_groups_dictionary() | ||
| redshift_group_members_dict = create_redshift_groups_dictionary(database=database) | ||
|
|
||
| redshift_valid_users_list = create_valid_database_users_list(database=database) | ||
| # Edit this list to specify users that should be ignored during group synchronization | ||
| ignore_users_list = ["dbadmin", "console"] | ||
| # ignore_groups_list = | ||
|
|
||
| for platform_group, redshift_group in group_names_crosswalk: | ||
| platform_group_members = platform_group_members_dict[platform_group] | ||
| platform_group_members = [x for x in platform_group_members if x in redshift_valid_users_list] | ||
|
|
||
| redshift_group_members = redshift_group_members_dict[redshift_group] | ||
| redshift_group_members = [x for x in redshift_group_members if x not in ignore_users_list] | ||
|
|
||
| to_add_to_redshift = set(platform_group_members) - set(redshift_group_members) | ||
| to_drop_from_redshift = set(redshift_group_members) - set(platform_group_members) | ||
|
|
||
| full_add_list.append((platform_group, to_add_to_redshift)) | ||
| full_drop_list.append((platform_group, to_drop_from_redshift)) | ||
|
|
||
| if to_add_to_redshift: | ||
| add_query = f"\nALTER GROUP {redshift_group} ADD USER {', '.join(to_add_to_redshift)};" | ||
| else: | ||
| add_query = f"\n--No users to add to {redshift_group}" | ||
|
|
||
| if to_drop_from_redshift: | ||
| drop_query = f"\nALTER GROUP {redshift_group} DROP USER {', '.join(to_drop_from_redshift)};" | ||
| else: | ||
| drop_query = f"\n--No users to drop from {redshift_group}" | ||
|
|
||
| full_query_text = full_query_text + f"\n\n--Platform group name: {platform_group}" + \ | ||
| f"\n--Corresponding Redshift group name: {redshift_group}" + \ | ||
| f"\n--Users to add: {add_query}" + f"\n--Users to drop: {drop_query}" | ||
|
|
||
| LOG.info(f"Full Platform Group change list: {full_platform_group_change_list}") | ||
| LOG.info(f"Full add list: {full_add_list}") | ||
| LOG.info(f"Full drop list: {full_drop_list}") | ||
|
|
||
| if dry_run: | ||
| LOG.info("Running in dry run mode. The following SQL generated but not executed:\n\n") | ||
| LOG.info(full_query_text) | ||
| else: | ||
| LOG.info("Running in full mode. The following SQL will be executed:\n\n") | ||
| LOG.info(full_query_text) | ||
| future = civis.io.query_civis(full_query_text, database = database, hidden = False) | ||
| LOG.info(future.result()) | ||
|
|
||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| # Different Platform/cloud environments use slightly different formats for Boolean parameters; | ||
| # This provides some assurance that "truthy" values are assigned properly. | ||
| DRY_RUN_PARAM = strtobool(str(os.environ.get('DRY_RUN', 'True'))) | ||
| DATABASE = os.environ.get('DATABASE') | ||
|
|
||
| if not DATABASE: | ||
| raise ValueError("DATABASE environment variable must be set") | ||
|
|
||
| main(database=DATABASE, dry_run=DRY_RUN_PARAM) No newline at end of file |
There was a problem hiding this comment.
Missing error handling for API and database failures. The script makes multiple API calls and database queries without try-except blocks. If any of these operations fail (network issues, authentication errors, database connection problems), the script will crash with an unclear error message. Add appropriate error handling with informative error messages for better operational reliability.
| return platform_group_members | ||
|
|
||
|
|
||
| def main(database, dry_run = True): |
There was a problem hiding this comment.
Inconsistent spacing around default parameter. The function definition has a space before the equals sign in dry_run = True. PEP 8 recommends no spaces around the equals sign for default parameter values. Should be dry_run=True.
| def main(database, dry_run = True): | |
| def main(database, dry_run=True): |
| else: | ||
| LOG.info("Running in full mode. The following SQL will be executed:\n\n") | ||
| LOG.info(full_query_text) | ||
| future = civis.io.query_civis(full_query_text, database = database, hidden = False) |
There was a problem hiding this comment.
Inconsistent spacing around keyword argument. The function call has a space before the equals sign in database = database. PEP 8 recommends no spaces around the equals sign for keyword arguments. Should be database=database.
|
|
||
| full_query_text = full_query_text + f"\n\n--Platform group name: {platform_group}" + \ | ||
| f"\n--Corresponding Redshift group name: {redshift_group}" + \ | ||
| f"\n--Users to add: {add_query}" + f"\n--Users to drop: {drop_query}" |
There was a problem hiding this comment.
Misleading comment labels in the generated SQL. Lines 186 say "Users to add:" and "Users to drop:" but these are followed by the actual SQL queries (or comments if there's nothing to do), not lists of users. Consider changing the labels to something like "Add query:" and "Drop query:" to better reflect what follows.
| f"\n--Users to add: {add_query}" + f"\n--Users to drop: {drop_query}" | |
| f"\n--Add query: {add_query}" + f"\n--Drop query: {drop_query}" |
| ```bash | ||
| python automate_groups.py | ||
| ``` |
There was a problem hiding this comment.
The usage instructions show running the script as python automate_groups.py but don't specify the correct path. Users would need to either be in the python/platform_redshift_groups/ directory or use the full path python python/platform_redshift_groups/automate_groups.py. Consider clarifying the working directory assumption or providing the full command.
| from distutils.util import strtobool | ||
|
|
There was a problem hiding this comment.
The distutils.util module is deprecated since Python 3.10 and was removed in Python 3.12. Replace strtobool with a custom implementation or use a different approach to parse boolean environment variables.
| from distutils.util import strtobool | |
| def strtobool(val): | |
| """ | |
| Convert a string representation of truth to True or False. | |
| This mirrors the behavior of distutils.util.strtobool, returning | |
| True for 'y', 'yes', 't', 'true', 'on', '1' and False for | |
| 'n', 'no', 'f', 'false', 'off', '0' (case-insensitive). | |
| Raises ValueError if the value is not a recognized boolean string. | |
| """ | |
| val = str(val).strip().lower() | |
| if val in ("y", "yes", "t", "true", "on", "1"): | |
| return True | |
| if val in ("n", "no", "f", "false", "off", "0"): | |
| return False | |
| raise ValueError(f"invalid truth value {val!r}") |
| # constants and empty lists for the loop | ||
| full_add_list = [] | ||
| full_drop_list = [] | ||
| full_platform_group_change_list = [] |
There was a problem hiding this comment.
Unused variable. The variable full_platform_group_change_list is initialized but never populated or used meaningfully. Remove it or implement the intended functionality.