Skip to content

Generic data upload#51

Open
mayamkay wants to merge 13 commits intomainfrom
generic-data-upload
Open

Generic data upload#51
mayamkay wants to merge 13 commits intomainfrom
generic-data-upload

Conversation

@mayamkay
Copy link

@mayamkay mayamkay commented Mar 3, 2026

This PR adds code that documents how to make a generic data upload template script. This is helpful if you want to allow clients to upload data but you wish to control for where the data goes, how it's handled etc.

The code was adapted from the PMI pmipipelinetools repo.

Example script is available here https://platform.civisanalytics.com/spa/#/scripts/containers/340695409

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a reusable/template “generic CSV upload” script for Civis Platform, intended to let clients upload data while controlling destination schema/table and notifying users on completion.

Changes:

  • Introduces generic_upload.py to determine a user’s target schema from a metadata table, upload a CSV into a target table, and optionally send a completion email.
  • Adds python/custom_file_uploads/README.md documenting setup (metadata table + Platform configuration) and usage.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 12 comments.

File Description
python/custom_file_uploads/generic_upload.py New upload script: schema lookup, table drop/recreate + CSV import, email notification flow
python/custom_file_uploads/README.md New documentation describing required Platform setup and configuration variables

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +180 to +181
full_table = f"{schema}.{table_name}"
LOG.info(f"Target table: {full_table}")
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

full_table is built directly from schema (read from a table) and table_name (user-controlled parameter) and then interpolated into SQL (DROP/CREATE/INSERT paths). This allows SQL injection or accidental writes/drops outside the intended target if either value contains characters like ./;/quotes. Validate both as safe SQL identifiers (e.g., strict regex + disallow dots) and/or map dropdown values to a fixed allowlist of table names instead of using raw input.

Copilot uses AI. Check for mistakes.
Comment on lines +114 to +115


Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

download_data_create_table catches upload exceptions, logs an error, and then returns without re-raising. This means main() will still send a success email and log "Upload process completed successfully" even when the upload failed. Re-raise the exception (or return a success flag that main() checks) so failures stop the workflow and don't trigger success notifications.

Suggested change
raise

Copilot uses AI. Check for mistakes.
database = os.environ["DATABASE"]
metadata_table = os.environ["METADATA_TABLE"]
email_address = os.getenv("EMAIL")
testing = int(os.getenv("TESTING", 0)) == 1
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testing = int(os.getenv("TESTING", 0)) == 1 only works for numeric values; if the Platform boolean parameter or env var is set to "true"/"false" (a common pattern elsewhere in this repo), this will crash. Parse booleans more defensively (e.g., accept 1/0 and true/false strings).

Copilot uses AI. Check for mistakes.
Comment on lines +44 to +52
```bash
cd /app;
export DATABASE='redshift-general'
export TESTING=0
export EMAIL_RECIPIENTS=""
export METADATA_TABLE="metadata_data_upload_mmk"
export EMAIL_SCRIPT_ID="340695845"
python python/custom_file_uploads/generic_upload.py
```
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example container config exports EMAIL_RECIPIENTS, but the script reads EMAIL for the recipient address and EMAIL_SCRIPT_ID for the notification script. Update the example to use the same env var names the script actually consumes, otherwise users will configure this and never receive emails.

Copilot uses AI. Check for mistakes.
- **FILE**: File parameter (required) - Users will upload their CSV file here
- **TABLE_NAME**: Dropdown or text parameter (required) - The name of the target table
- **EMAIL**: Text parameter (optional) - Email address for notification
- **TESTING**: Boolean parameter (optional) - Set to true to skip sending emails
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README says TESTING is a boolean parameter, but the script currently expects TESTING to be numeric (0/1) when parsing the environment. Either update the README to specify 0/1, or update the script to accept "true"/"false" strings to match the documented behavior.

Suggested change
- **TESTING**: Boolean parameter (optional) - Set to true to skip sending emails
- **TESTING**: Integer parameter (optional) - Set to `1` to skip sending emails, `0` to send emails

Copilot uses AI. Check for mistakes.
Comment on lines +29 to +35
import civis
import pandas as pd

LOG = civis.loggers.civis_logger()


def get_schema(metadata_table: str, database: str, client=None) -> str:
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_schema is annotated as returning str, but it returns a 2-tuple (schema, user_email). Update the return type annotation (and ideally the docstring) to reflect the actual return type to avoid misleading callers/static type checks.

Suggested change
import civis
import pandas as pd
LOG = civis.loggers.civis_logger()
def get_schema(metadata_table: str, database: str, client=None) -> str:
from typing import Tuple
import civis
import pandas as pd
LOG = civis.loggers.civis_logger()
def get_schema(metadata_table: str, database: str, client=None) -> Tuple[str, str]:
"""Return the user's schema name and email address.
Returns
-------
Tuple[str, str]
A 2-tuple of (schema_name, user_email).
"""

Copilot uses AI. Check for mistakes.
LOG.info(f"Downloading file: {file_obj['name']}")

with tempfile.TemporaryDirectory() as tmpdir:
tmp_path = os.path.join(tmpdir, file_obj["name"])
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tmp_path is built using file_obj["name"] directly. Since file names can be user-controlled, it's safer to sanitize it (e.g., os.path.basename(...)) to avoid unexpected path separators or traversal-like behavior when writing to disk.

Suggested change
tmp_path = os.path.join(tmpdir, file_obj["name"])
safe_name = os.path.basename(file_obj["name"])
tmp_path = os.path.join(tmpdir, safe_name)

Copilot uses AI. Check for mistakes.
# Use a blank script on platform to trigger email notification
# NOTE: This requires an existing script ID that you can configure
# to send success notification emails
email_script_id = int(os.getenv("EMAIL_SCRIPT_ID"))
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

email_script_id = int(os.getenv("EMAIL_SCRIPT_ID")) will raise a non-obvious TypeError if the env var is missing/empty. Prefer reading via os.environ[...] (so the KeyError names the variable) or add an explicit check that raises a clear ValueError explaining how to configure EMAIL_SCRIPT_ID.

Suggested change
email_script_id = int(os.getenv("EMAIL_SCRIPT_ID"))
email_script_id_str = os.getenv("EMAIL_SCRIPT_ID")
if not email_script_id_str:
raise ValueError(
"EMAIL_SCRIPT_ID environment variable must be set to a valid "
"Civis Platform script ID in order to send notification emails."
)
try:
email_script_id = int(email_script_id_str)
except (TypeError, ValueError):
raise ValueError(
f"EMAIL_SCRIPT_ID must be an integer Civis Platform script ID, "
f"got {email_script_id_str!r}."
)

Copilot uses AI. Check for mistakes.
Comment on lines +148 to +157
client.scripts.patch_python3(
id=email_script_id,
name=f"Upload notification for {user_email}",
notifications={
"success_email_subject": email_subject,
"success_email_body": email_body,
"success_email_addresses": [recipient_email],
},
)
civis.utils.run_job(email_script_id, client=client).result()
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Patching a shared Platform script's notifications (client.scripts.patch_python3(...)) right before running it is race-prone: concurrent uploads can overwrite each other's notification settings, potentially emailing the wrong recipient/body. Consider a per-run notification mechanism that doesn't mutate shared script state (e.g., separate notification scripts per tenant, or avoid patching and instead configure notifications on the upload script itself).

Copilot uses AI. Check for mistakes.
Comment on lines +108 to +109
- **Email not received**: Check that the SCRIPT_ID points to a valid script and TESTING is set to 0

Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Troubleshooting references SCRIPT_ID, but the script and earlier setup steps use EMAIL_SCRIPT_ID. Rename this reference to match the actual configuration variable to avoid confusion.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants