Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
133 changes: 133 additions & 0 deletions python/grant_schemas/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
# Schema-Based Granting

## Overview

Many developed databases hold thousands of different tables and views, which makes organization and discovery of data difficult and greatly complicates questions of data sharing, granting, and access.

One remedy is to organize tables into schemas that align with different audiences and use cases, and then make and enforce sharing decisions accordingly. Even when there are exceptions, it is a significant improvement to think of dozens of schemas instead of thousands of individual tables.

However, schemas are a relatively under-supported and unintuitive database feature for sharing decisions, for a few main reasons:

- Schema and table permissions work together hierarchically in Redshift - users need **both** USAGE permission on the schema **and** SELECT permission on the tables. Granting one without the other is insufficient, making permission management more complex than simple folder-based sharing
- Schema purposes cannot always be clearly intuited from their names, requiring the use of thorough external documentation for users to know where certain kinds of data should go
- Newly created data tables inherit no permissions from their schema and are only accessible to the table owner and superusers by default, regardless of what groups have access to the schema itself

## Purpose

This script automates database sharing to work more like shared folders in collaborative file sharing software, such that:

1. Individuals primarily have access to different topic schemas based on their **group** membership, with minimal cases of person-level exceptions
2. Tables put into a schema are understood to be made **automatically** available to members of those groups (not a Redshift/SQL default behavior)
3. Behavior is documented and explained directly in context, so that users are not surprised by cases where:
- Data is **not** shared with other users as expected, or
- Data **is** shared with other users when it was **not** expected (data leakage)

## Configuration

### Step 1: Set Environment Variables

```bash
export DATABASE="your_database_name" # Required: The name of your database
export DRY_RUN="True" # Optional: Set to False to execute changes
export GRANT_USAGE="False" # Optional: Set to True to also grant USAGE on schemas
export GRANT_FUTURE="True" # Optional: Set to False to skip future table grants (default: True)
```

### Step 2: Configure Schema Grants

Edit the `SCHEMA_GRANTS_CONFIG` list in `automate_schema_grants.py` to define which groups should have access to which schemas:

```python
SCHEMA_GRANTS_CONFIG = [
{
'schema_name': 'reporting',
'groups': ['analysts', 'managers', 'executives'],
'table_creators': ['etl_user', 'data_engineer_bot'], # Users who create tables
'notes': 'Reporting tables for business intelligence'
},
{
'schema_name': 'raw_data',
'groups': ['data_engineers', 'etl_users'],
'table_creators': ['etl_service_account'],
'notes': 'Raw data ingestion schema'
},
]
```

**IMPORTANT - `table_creators` Configuration:**

In Redshift, `ALTER DEFAULT PRIVILEGES` only applies to objects created by **specific users**. You must list all users who might create tables in the `table_creators` field. Common users to include:
- Service accounts (e.g., `etl_service_account`, `airflow_user`)
- Application users that create tables
- Data engineers or analysts with CREATE privileges

If you omit `table_creators`, default privileges will only apply to tables created by the user running this script, meaning tables created by other users won't automatically inherit the correct permissions.

## Usage

### Dry Run (Preview Changes)

```bash
python automate_schema_grants.py
```

This will log the SQL GRANT statements that would be executed without actually making any changes.

### Execute Changes

```bash
export DRY_RUN="False"
python automate_schema_grants.py
```

This will execute the SQL statements to grant permissions.

### Grant USAGE Permissions

By default, the script only grants SELECT permissions on tables. To also grant USAGE permissions on the schemas themselves:

```bash
export GRANT_USAGE="True"
python automate_schema_grants.py
```

### Disable Future Object Grants

By default, the script grants permissions on both existing and future tables/views. To only grant on existing objects:

```bash
export GRANT_FUTURE="False"
python automate_schema_grants.py
```

## Tables vs Views

In Redshift, the command `GRANT SELECT ON ALL TABLES IN SCHEMA` covers:
- Regular tables
- Views
- External tables
- Late-binding views

Similarly, `ALTER DEFAULT PRIVILEGES` applies to both tables and views created in the future. This means you don't need separate commands for views - they're automatically included.

## How It Works

1. **Reads Configuration**: Loads the schema-to-groups mapping from `SCHEMA_GRANTS_CONFIG`
2. **Generates GRANT Statements**: Creates SQL statements to grant SELECT (and optionally USAGE) permissions
- Grants SELECT on all existing tables and views in each schema
- Optionally grants USAGE on the schema itself (required for accessing tables/views)
- Sets default privileges for future tables and views created by specified users
3. **Executes or Logs**: Either executes the changes (when `DRY_RUN=False`) or logs them for review

**Note**: In Redshift, "ALL TABLES" includes tables, views, and external tables that currently exist in the schema.

**Critical Limitation**: The `ALTER DEFAULT PRIVILEGES` command only applies to objects created by specific users. The script uses `FOR USER <username>` to grant privileges on future objects created by each user in the `table_creators` list. If a user not in this list creates a table, the permissions will **not** be automatically applied, and you'll need to either:
- Re-run this script to grant on the newly created tables
- Add that user to the `table_creators` list and re-run the script

## Requirements

- Python 3.x
- `civis` Python package
- Superuser/admin access to the target database
- Appropriate Civis Platform API credentials
167 changes: 167 additions & 0 deletions python/grant_schemas/automate_schema_grants.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,167 @@
"""
This script automates granting SELECT access (and optionally USAGE) on
database schemas to specified groups.
For each configured schema, it grants permissions to all associated groups
on all tables and views within that schema.

This script must be run with authorized superuser account credential on the
affected database.

Configuration:
- Set the DATABASE environment variable to specify which database to use
- Edit the SCHEMA_GRANTS_CONFIG list below to map schemas to their authorized
groups
- Set DRY_RUN=True to preview changes without executing them
- Set GRANT_USAGE=True to also grant USAGE permissions on schemas
- Set GRANT_FUTURE=True to grant permissions on future tables/views
(default: True)

IMPORTANT: ALTER DEFAULT PRIVILEGES in Redshift only applies to objects
created by specific users. You must specify the 'table_creators' list
for each schema to include all users who might create tables.
Otherwise, default privileges will only apply to tables created
by the user running this script.

Note: In Redshift, "ALL TABLES" includes tables, views, and external tables.
"""

import civis
import os
import logging
from distutils.util import strtobool

Comment on lines +31 to +32
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The distutils module is deprecated and will be removed in Python 3.12. Consider using a more modern alternative such as checking if the string is in a list of accepted true values (e.g., ['true', '1', 'yes']), or using a library like python-dotenv that handles environment variable parsing.

Suggested change
from distutils.util import strtobool
def strtobool(val):
"""
Convert a string representation of truth to True or False.
Accepted true values are: 'y', 'yes', 't', 'true', 'on', '1'.
Accepted false values are: 'n', 'no', 'f', 'false', 'off', '0'.
Raises ValueError if 'val' is anything else.
"""
val = str(val).strip().lower()
if val in ("y", "yes", "t", "true", "on", "1"):
return True
if val in ("n", "no", "f", "false", "off", "0"):
return False
raise ValueError(f"invalid truth value {val!r}")

Copilot uses AI. Check for mistakes.
# Setting up logging
LOG = logging.getLogger(__name__)
FORMAT = "%(asctime)-15s %(levelname)s:%(name)s.%(funcName)s:%(lineno)s %(message)s"
logging.basicConfig(level=logging.INFO, format=FORMAT)

# ========================================
# CONFIGURATION: Edit this list for your specific use case
# ========================================
SCHEMA_GRANTS_CONFIG = [
{
"schema_name": "example_schema",
"groups": ["example_group_1", "example_group_2", "read_only_users"],
"table_creators": [], # Optional: usernames who can create tables in this schema
"notes": "Example schema - replace with your actual schemas and groups",
},
# Add more schema-to-groups mappings here as needed
# {
# 'schema_name': 'analytics_schema',
# 'groups': ['analysts', 'data_engineers', 'reporting_users'],
# 'table_creators': ['etl_user', 'data_engineer_bot'], # Users who create tables
# 'notes': 'Analytics schema for reporting team'
# },
]


def get_schema_grants_config():
"""
Returns the schema grants configuration from the code-based SCHEMA_GRANTS_CONFIG.
Returns a dict mapping schema names to their authorized groups:
{
'schema_name': {
'schema': 'schema_name',
'groups': ['group1', 'group2', ...],
'table_creators': ['user1', 'user2', ...]
},
...
}
"""
mapping = {}

for config in SCHEMA_GRANTS_CONFIG:
schema = config.get("schema_name")
groups = config.get("groups", [])
table_creators = config.get("table_creators", [])

if schema and groups:
mapping[schema] = {
"schema": schema,
Comment on lines +79 to +80
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 'schema' field in the returned dictionary duplicates the 'schema_name' key. Since the dictionary is keyed by schema name, storing 'schema': schema_name is redundant. Consider removing this field and using the dictionary key directly when accessing the schema name.

Copilot uses AI. Check for mistakes.
"groups": tuple(groups),
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converting groups from a list to a tuple is unnecessary here. The groups are only used for iteration in the SQL generation and converting to a tuple doesn't provide any functional benefit. This could be simplified by keeping groups as a list, which would be more consistent with the input format and the table_creators field which remains a list.

Suggested change
"groups": tuple(groups),
"groups": groups,

Copilot uses AI. Check for mistakes.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like 4 lines to say drop the tuple is a bit much

"table_creators": table_creators,
}

return mapping


def main(database, dry_run=True, grant_usage=False, grant_future=True):
grant_commands = []
schema_grants = get_schema_grants_config()
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script does not validate that SCHEMA_GRANTS_CONFIG contains at least one valid configuration entry before proceeding. If the configuration is empty or all entries are invalid (missing schema or groups), the script will generate an empty query string and potentially execute it, which could be confusing. Consider adding a check after get_schema_grants_config() to ensure there is at least one valid schema configuration, and raise a clear error if not.

Suggested change
schema_grants = get_schema_grants_config()
schema_grants = get_schema_grants_config()
if not schema_grants:
LOG.error(
"No valid schema grant configurations found. "
"SCHEMA_GRANTS_CONFIG must contain at least one entry with a "
"'schema_name' and a non-empty 'groups' list."
)
raise ValueError(
"Invalid configuration: SCHEMA_GRANTS_CONFIG must contain at least "
"one schema with associated groups."
)

Copilot uses AI. Check for mistakes.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a bad idea but probably not necessary


for schema_name in schema_grants:
schema = schema_grants[schema_name]["schema"]
groups = schema_grants[schema_name]["groups"]

if grant_usage:
usage_command = (
f"GRANT USAGE ON SCHEMA {schema} TO GROUP {', GROUP '.join(groups)};"
)
grant_commands.append(usage_command)

# Grant on existing tables and views
# Note: In Redshift, "ALL TABLES" includes tables, views, and external tables
select_command = f"""GRANT SELECT ON ALL TABLES IN SCHEMA {schema}
TO GROUP {', GROUP '.join(groups)};"""
grant_commands.append(select_command)

# Grant on future tables and views (if enabled)
# IMPORTANT: ALTER DEFAULT PRIVILEGES only applies to objects created by specific users
if grant_future:
table_creators = schema_grants[schema_name].get("table_creators", [])

if table_creators:
# Grant for each specified table creator
for creator in table_creators:
for group in groups:
future_command = (
f"ALTER DEFAULT PRIVILEGES FOR USER {creator} IN SCHEMA {schema} "
f"GRANT SELECT ON TABLES TO GROUP {group};"
)
grant_commands.append(future_command)
else:
# No table_creators specified - grant for the current user running the script
# This will only apply to tables created by this user!
for group in groups:
future_command = (
f"ALTER DEFAULT PRIVILEGES IN SCHEMA {schema} "
f"GRANT SELECT ON TABLES TO GROUP {group};"
)
grant_commands.append(future_command)
LOG.warning(
f"No table_creators specified for schema '{schema}'. "
f"Default privileges will only apply to tables created"
" by the user running this script."
)

query = "\n".join(grant_commands)

if dry_run:
LOG.info(
"Running in dry run mode. The following SQL generated but not executed:\n\n"
)
LOG.info(query)
else:
LOG.info("Running in full mode. The following SQL will be executed:\n\n")
LOG.info(query)
future = civis.io.query_civis(query, database=database, hidden=False)
LOG.info(future.result())
Comment on lines +147 to +148
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No error handling for the query execution. If the SQL execution fails (due to insufficient permissions, invalid schema names, or database connectivity issues), the script will crash without providing actionable feedback. Consider wrapping the query execution in a try-except block and logging specific error information to help users troubleshoot issues.

Suggested change
future = civis.io.query_civis(query, database=database, hidden=False)
LOG.info(future.result())
try:
future = civis.io.query_civis(query, database=database, hidden=False)
result = future.result()
LOG.info(result)
except Exception as exc:
LOG.error(
"Failed to execute grant commands on database '%s': %s",
database,
exc,
exc_info=True,
)
raise

Copilot uses AI. Check for mistakes.


if __name__ == "__main__":
# Different Platform/cloud environments use slightly different formats for Boolean parameters;
# This provides some assurance that "truthy" values are assigned properly.
DRY_RUN_PARAM = strtobool(str(os.environ.get("DRY_RUN", "True")))
GRANT_USAGE = strtobool(str(os.environ.get("GRANT_USAGE", "False")))
GRANT_FUTURE = strtobool(str(os.environ.get("GRANT_FUTURE", "True")))
DATABASE = os.environ.get("DATABASE")

if not DATABASE:
raise ValueError("DATABASE environment variable must be set")

main(
database=DATABASE,
dry_run=DRY_RUN_PARAM,
grant_usage=GRANT_USAGE,
grant_future=GRANT_FUTURE,
)