Changes to optional columns, and corresponding featurizing functions#69
Merged
poornimaramesh merged 7 commits intomainfrom Feb 25, 2026
Merged
Changes to optional columns, and corresponding featurizing functions#69poornimaramesh merged 7 commits intomainfrom
poornimaramesh merged 7 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates the CIDER featurizer to make certain columns optional, specifically for the MobileAid Kenya deployment. The changes enable the system to work with CDR and mobile money data that may lack antenna location information and balance data, which are not available in all data sources.
Changes:
- Updated CDR and mobile money schemas to make antenna IDs and balance fields optional
- Modified featurizer functions to conditionally compute features based on available columns
- Enhanced synthetic data generation to support optional columns via
keep_optional_columnsparameter - Added docstrings to schema classes and enums for improved documentation
Reviewed changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 17 comments.
Show a summary per file
| File | Description |
|---|---|
| src/cider/schemas.py | Changed Pydantic config from v2 to v1 style; made balance fields optional in MobileMoneyTransactionData; made recipient_id required; added docstrings |
| src/cider/featurizer/schemas.py | Changed Pydantic config from v2 to v1 style; made balance fields optional in MobileMoneyDataWithDirection |
| src/cider/featurizer/dependencies.py | Updated swap_caller_and_recipient and identify_mobile_money_transaction_direction to handle missing antenna and balance columns |
| src/cider/featurizer/core.py | Added conditional logic to skip antenna-based and balance-based features when columns are missing; added dropna calls before processing optional columns; added check for missing AntennaData |
| src/cider/utils.py | Added check_optional_columns parameter to validate_dataframe; updated _get_column_types to handle optional fields; updated synthetic data functions to support keep_optional_columns parameter |
| tests/test_utils.py | Parameterized tests to cover both keep_optional_columns=True and False scenarios |
| .pre-commit-config.yaml | Added --ignore-nested-classes flag to interrogate to accommodate nested Config classes |
Comments suppressed due to low confidence (1)
src/cider/featurizer/core.py:2638
- The featurize_cdr_data function now handles missing AntennaData by checking if it's in preprocessed_data (line 2638). However, featurize_all_data still assumes all other data types (MobileDataUsageData, MobileMoneyTransactionData, RechargeData) are always present. The preprocess_data function can skip schemas that are not in the input data_dict (see lines 2373-2379), which could cause KeyError exceptions here if those schemas are missing. Consider adding similar checks for the other schema types, or ensuring all required schemas are present before calling this function.
"""
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Reviewer:
Estimate:
Ticket
Fixes: Issue 63
Description
Update optional and required columns, as part of the fixes for the MobileAid program
Changes
1️. Updates for featurizing CDR data
a. Update expected raw CDR schema for MobileAid Kenya.
Required input columns:
caller_id
recipient_id
timestamp
duration
caller_antenna_id
transaction_type
b. Make featurizer functions referencing caller_antenna_id / recipient_antenna_id optional (since it may not exist in raw data).
2️. Update featurizing mobile money data
a. Update expected raw mobile money schema for MobileAid Kenya.
Required input columns:
timestamp
caller_id
recipient_id
amount
transaction_type
b. Make featurizer functions referencing caller / recipient balance optional (since it may not exist in raw data)
How has this been tested?
I've added a new variable to the demo pipeline notebook while generating synthetic data:
keep_optional_columns.If we set this to False, then the generated data keeps only the expected columns for CDR and mobile money (as defined above)
Then subsequent preprocessing and featurization steps should automatically adapt to the reduced number of columns (i.e. they should run without errors). At the end of the featurization step (step 4), with keep_optional_columns = False, there should be 741 columns. With keep_optional_columns=True, there should be 831 columns.
All run
make testsand ensure they pass.Checklist
Fill with
xfor completed.pre-commithooks locally