This is a pipeline to predict daily and weekly mood (self-reported from PHQ surveys) from passive and active smartphone sensor features (mobility/movement amounts, communication metrics, other surveys). The data is from the Synapse.org BRIGHTEN (V1/V2) study, which can be downloaded online.
The BRIGHTEN dataset contains daily and weekly smartphone sensor data alongside clinical survey responses (PHQ-9, PHQ-2, GAD, etc.) collected across two study versions (v1, v2). The goal is to predict depression severity and change over an 8-week window.
| Notebook | Role |
|---|---|
01_cleaning |
Raw ingestion, merging, train/test split |
02_Pipeline |
Feature engineering, transformation, slope/intercept features |
02_subject-networks |
Per-subject PC correlation networks |
03_demographic_clustering |
Baseline demographic/clinical subgroup discovery |
04_predictive_models |
Full predictive modeling: population and per-subject, PHQ-2 and PHQ-9, ANOVA/SHAP feature selection, 6-week outcome prediction |
-
BRIGHTEN Data:
- Baseline: clinical scales (PHQ-9, PHQ-2, GAD-7, SDS, alcohol use, sleep quality, GIC mood scale, mental health services), baseline PHQ-9, demographics, and mania screening.
- Groups: Text-intervention group, control group
- Time-varying:
- Passive data: data extracted from participants' smartphones, including movement metrics, communication metrics, weather metrics (V2 only)
- Survey data:
- Daily surveys: of PHQ2 (depression and anxiety level)
- Weekly survey: of PHQ9 (in-depth depression level)
- Less than weekly surveys: Sleep, Mood change since start of study, Stress change since start of study
-
Feature extraction:
- Hierarchical agglomerative clustering + PCA on variables between-persons
- Hierarchical agglomerative clustering + PCA averaging over within-person correlation structures
- Slope/intercept for 2-week chunks of variables within-person
- Baseline data PCA to define demographic + clinical groups at baseline
- Clinical vs. nonclinical identification, based on baseline clinical surveys
- Groups based on variable levels-- high mobility and low mobility groups; high communication and low communication groups, etc
- Outcome variables for grouping-- depressed at baseline, depressed after X weeks (end), etc
-
Analyses:
- ANOVA and MANOVA for extracting important features
- Within-person correlation structure (subject networks) of raw variables
- Within-person correlation structure (subject networks) of PC components
- Network based statistics (see Zalesky 2010) to look for differences in subject networks between clinical and nonclinical groups
-
Predictive modelling:
-
Variety of ML structures testing variants across V1_day, V2_day, V1_week, V2_week Input data variants:
- Using different sets of extracted features as inputs (described above)
- Using extracted features + processed variables as inputs
- Using
Model structure variants:
- Group: Using training set of individuals, and test set of different individuals
- Note: Cross-validation uses
GroupKFold(n=5) to ensure participants are not split across train and test folds.
- Note: Cross-validation uses
- Individual: Using training set of one individual's first 4 weeks, test set of that same individual's last 2 weeks
Output prediction variants:
- Binary prediction of depressed/not depressed at end
- Predicting daily depression score for an individual (PHQ2 daily survey)
- Predicting weekly depression score for an individual (PH9 weekly survey)
-
Model Evaluations:
- SHAP for feature importance within each model (averaged across 5 CVs)
- Dummy regressors: For individual-level models, uses evaluation in comparison to just predicting an individuals' prior depression score average (if using past days to predict future days). For group-level models, uses evaluation in comparison to predicting the train group's depression score average.
- R², MAE and Product-moment correlation (averaged across 5 CVs) for evaluation of predicted vs. real scores.
-
Notes on the project:
- The dataset is kept split by study version (V1/V2) and temporal granularity (daily/weekly) throughout, since V1 and V2 and day/week have some different sensor modalities and survey questions. Thus the core DataFrames are:
v1_day,v2_day,v1_week,v2_week. - Intermediate transformed files are read/written across steps (e.g.
{name}_transformed.csv,{name}_baseFilled.csvwherename∈v1_day,v2_day,v1_week,v2_week) - A participant-level group split prevents data leakage between train and test.
- Some prediction is done using all the days of one group of subjects to predict others. Other prediction is done using some X number of days of one subject to predict their future days. If this is the case, the CV data is restricted to 'training days' and the test is restricted to 'test days'. (Generally individual training and prediction is limited, because subjects have mostly <100 days).
- Binary indicator columns track missingness for passive sensor features, preserving the signal that data was absent.
- Target variables are the PHQ-2 and PHQ-9 depression scales.
GITHUB REPO:
https://github.com/kaleyjoss/smartphone_sensor_modelling/blob/main/01_cleaning.ipynb
Overview
01_cleaning.ipynb takes the raw data from Synapse.org and cleans it to be able to go into the processing pipelines in the second notebook. It creates dfs for V1 and V2 data, separates by _day and _week, imputes missing data when appropriate, drops overly-missing subjects, and splits the data into train and test sets.
Structure & Flow
1. Setup & Configuration
The notebook imports standard libraries and loads custom project scripts (preprocessing, visualization, variables). It defines key variable lists — column groupings for clinical scales, sensor features, and target variables (PHQ-2 and PHQ-9 depression scores).
2. Load Raw Files Reads in 15+ raw CSVs from the BRIGHTEN data directory, covering: clinical scales (PHQ-9, PHQ-2, GAD-7, SDS, alcohol use, sleep quality, GIC mood scale, mental health services), passive phone features (V1 and V2), weather data (V2), mobility/GPS data (V2), cluster entries (V2), baseline PHQ-9, demographics, and mania screening.
3. Clean & Standardize Column Names Renames columns to consistent snake_case names, creates binary/categorical summary columns (e.g., sum scores, category bins) for each clinical scale.
4. Demographics Encoding
Encodes all categorical demographic variables (gender, race, education, marital status, etc.) using LabelEncoder, saves an encoder key CSV for interpretability, and generates a participant ID-to-numeric-ID mapping.
5. Merge Into a Single DataFrame
Outer-joins all individual DataFrames on shared ID columns (participant_id, dt, week, version) into one large merged DataFrame, then saves it as raw_merged_df.csv.
6. Preprocess the Daily DF
Combines duplicate rows from the same day (via averaging, forward-fill, and back-fill), then reindexes each participant's date range to include all consecutive days — not just days with observations — and adds temporal features (day of week, month, season). Produces alldays_df. The data is split by study version (V1 vs V2) and time granularity (daily vs weekly), yielding four core DataFrames: v1_day, v2_day, v1_week, v2_week.
7. Hours Accounted For (V2 Filtering) Inspects the GPS/mobility "hours accounted for" variable in V2. Rows with fewer than 6 hours of GPS coverage are set to NaN to avoid low-quality sensor data contaminating features. V2 distance metrics are also normalized per hour accounted for.
8. Create Weekly DF Aggregates the daily data into weekly summaries, keeping only participants with at least 4 weeks of data for weekly analyses.
9. NaN Analysis & Filtering
Examines missingness per participant and per variable. Drops subjects or variables that fall below a coverage threshold. Flags certain zero-values (e.g., hours_of_sleep == 0) as likely missing rather than true zeros and converts them to NaN.
10. Train/Test Split
Uses GroupShuffleSplit to split each of the four DataFrames (v1_day, v2_day, v1_week, v2_week) into train+validation and test sets, ensuring no participant appears in both sets. Saves {name}_trainval.csv and {name}_test.csv for each.
Notes/FYI
- The dataset is kept split by study version (V1/V2) and temporal granularity (daily/weekly) throughout, since V1 and V2 have meaningfully different sensor modalities.
- A participant-level group split prevents data leakage between train and test.
- Binary indicator columns track missingness for passive sensor features, preserving the signal that data was absent.
- Target variables are the PHQ-2 and PHQ-9 depression scales.
Background
scripts/variables.py— project-specific column definitions (sensor cols, survey cols, ID cols, etc.)
The notebook expects CSV files in a directory referenced by brighten_dir, named by convention:
{name}_trainval.csv— raw long-format data (wherename∈{v1_day, v2_day, v1_week, v2_week})- Intermediate transformed files are read/written across steps (e.g.
_transformed.csv,_baseFilled.csv)
1. Distribution Analysis — Skewness and kurtosis are assessed for all numeric features to guide transformation choices.
2. Feature Transformation — A ColumnTransformer applies Yeo-Johnson or Box-Cox transforms to skewed columns, and ordinal/one-hot encoding to categorical demographics (race, gender, income, etc.).
3. Imputation & Gap Filling — Weekly columns are forward- and back-filled. Season is added as a derived feature from the date column.
4. Long → Wide Reshaping — Longitudinal data is filtered to 8 weeks and pivoted to one row per participant, aggregating daily/weekly features.
5. Target Construction — PHQ-9 change scores (continuous and categorical) are merged in as prediction targets.
6. Correlation Filtering — Highly correlated feature pairs (r > 0.7) are flagged and removed.
7. Slope/Intercept Feature Engineering — For each participant, linear regression is fit over 2-week rolling blocks of daily sensor data. The resulting slopes and intercepts are pivoted wide, creating a compact temporal summary.
8. Modeling — HistGradientBoostingRegressor and HistGradientBoostingClassifier are trained on both raw-wide and slope/intercept feature sets, evaluated via cross-validated R² and MAE.
9. SHAP Analysis — Feature importance is computed via SHAP for each model/dataset combination.
- Raw wide features outperform slope/intercept features alone
- Adding baseline demographic features improves performance similarly regardless of whether raw or slope/intercept features are used
- Top predictive features (from SHAP) include
screen_4,device,mhs_3,phq9_9_base,screen_3, andheard_about_us
Notes/FYI
- The four dataset variants (
v1_day,v2_day,v1_week,v2_week) are processed in parallel throughout — most loops iterate over all four - Several cells appear incomplete or are placeholders — the notebook is exploratory in nature and not all cells are intended to be run sequentially
- Intermediate CSVs are written to disk between steps, so individual sections can be re-run independently
BRIGHTEN Smartphone Sensor Pipeline — Part 3: Subject-Level Feature Networks
02_subject-networks.ipynb is the third notebook in the BRIGHTEN smartphone sensor modeling pipeline, following 01_cleaning.ipynb (raw data ingestion, merging, and participant-level train/test splitting) and 02_Pipeline.ipynb (distribution analysis, feature transformation, imputation, wide reshaping, slope/intercept feature engineering, HistGradientBoosting modeling, and SHAP analysis).
Where the prior notebooks treat the feature space statistically — via correlation filtering, PCA, and aggregate SHAP importance — this notebook shifts to a subject-level, network-based representation of feature relationships. The core idea is to examine how sensor-derived principal components co-vary within individual participants over time, visualizing intra-subject feature interaction structure using signed correlation networks.
Purpose
02_Pipeline reduced raw longitudinal sensor features into principal components (PCs) capturing latent structure across behavioral modalities (sleep, mobility, social interaction, mood self-report, etc.). A natural follow-on question is: do the relationships among these PCs differ meaningfully across individuals? Depressed individuals may show qualitatively different coupling between, say, mobility and mood self-report than non-depressed individuals, even if their mean feature values look similar. Within-person averaged networks operationalize this intuition, enabling idiographic analysis.
1. Setup & Imports
Standard scientific Python stack plus a custom feature_selection module (aliased fs) from the project's scripts/ directory. The fs.plotnetwork function is the primary workhorse.
2. Load PC-Transformed Data
Reads PC-scored longitudinal DataFrames produced by 02_Pipeline — one per dataset variant (v1_day, v2_day, v1_week, v2_week). Each row is a participant-day or participant-week observation; columns are extracted PC components (e.g., pc_sleep_mood_mhs, pc_phq2, pc_calls, pc_mobility, pc_mobility-radius, pc_missed-interactions, pc_texts, pc_lieAwake, pc_morning-interac).
3. Per-Subject Network Construction
For each participant, a signed correlation (or partial correlation) network is computed over their longitudinal PC time series. Each PC component becomes a node; edge weights = pairwise temporal co-variation for that individual. This is computed in a loop over subject IDs, with a filter for participants with sufficient longitudinal coverage to produce stable estimates (consistent with the ≥4-week minimum used in 02_Pipeline).
4. Network Visualization via fs.plotnetwork
Each subject's network is rendered by fs.plotnetwork with signed edge coloring — green edges indicate positive temporal co-variation, red edges indicate negative co-variation. Plot titles encode dataset variant and participant ID (e.g., v1_day: sub 2074.0).
5. (Downstream) Network-Level Feature Extraction
Subject networks can be summarized into graph-theoretic descriptors — edge density, signed modularity, hub nodes, mean edge weight — for use as additional features in the depression prediction models from 02_Pipeline, or as a standalone analysis of individual differences in behavioral feature coupling as a function of depression trajectory.
| File | Description |
|---|---|
{name}_transformed.csv (or equivalent PC-scored output) |
Long-format participant × time matrices with PC components as columns, produced by 02_Pipeline.ipynb |
Dataset variants: v1_day, v2_day, v1_week, v2_week
The PC components visualized here are the same latent dimensions identified as predictively relevant in 02_Pipeline (e.g., pc_sleep_mood_mhs, pc_phq2, pc_mobility). This notebook provides an individualized, temporal-relational lens on features that the modeling notebook treats in aggregate.
Notes
- Visualizations are cleared from the committed notebook; re-run all cells to regenerate plots.
- Network estimation is per-subject — participants with sparse longitudinal coverage will produce unstable estimates. Apply the same ≥4-week minimum used upstream.
- V1 and V2 have different PC compositions due to different sensor modalities (V2 includes GPS/mobility features absent from V1); networks should not be compared directly across versions.
- The notebook is exploratory in nature, consistent with the conventions of the broader pipeline.
Example Within-subject PC correlation networks
This is the fourth notebook in the BRIGHTEN pipeline, sitting after 01_cleaning → 02_Pipeline → 02_subject-networks. Where prior notebooks focus on sensor features and temporal modeling, this one turns to static baseline characteristics, (demographics and symptoms) — and investigates whether clinically or demographically meaningful subgroups exist in the study sample. Prior notebooks work with longitudinal sensor data, bu tthis notebook works with baseline-only snapshots, so one row per participant.
The goal is to discover whether the BRIGHTEN sample contains meaningful clinical subgroups — e.g., a high-anxiety, low-depression cluster vs. a psychosis-risk cluster vs. a mild-symptoms cluster — that might moderate the relationship between sensor features and depression outcomes found in 02_Pipeline. Cluster labels could feed back as stratification variables for the predictive models, or serve as a standalone characterization of who participates in digital mental health research.
Rather than reading the longitudinal _trainval.csv files, this notebook reads a specific set of baseline CSVs directly from BRIGHTEN_data/:
PHQ-9 - Baseline.csv— baseline depression severity (9-item scale)Baseline Demographics.csv— gender, race, education, marital status, etc.IMPACT Mania and Psychosis Screening.csv— screening for bipolar/psychosis symptomsAlcohol.csv— alcohol useGAD - Anxiety.csv— baseline anxiety severity (7-item scale)Mental Health Services.csv— current service utilizationid_key.csv— mapsparticipant_idto anonymizednum_id
- Cluster variation for each demographic variable
- Cluster profiles
1. Setup & Imports
Loads the standard scientific Python stack plus three custom project scripts: preprocessing, visualization, and a clustering (aliased cl). This clustering module is unique to this notebook and wraps KMeans, agglomerative clustering, silhouette scoring, and dendrogram utilities, which are also imported directly from sklearn/scipy.
2. Data Loading — Baseline Clinical & Demographic Tables
3. Cleaning & Feature Construction
Each loaded DataFrame is passed through a consistent cleaning loop that:
- Merges in the
id_keyand dropsparticipant_idto preserve anonymity - Drops irrelevant or redundant columns (
dt,study,cohort,device,heard_about_us, etc.) - Computes composite sum scores where applicable:
phq9_sum_base— sum of all 9 PHQ-9 baseline itemsmhs_sum— sum of mental health service use itemsbipolar— sum of screen_2 + screen_3 (bipolar screening)scz— sum of screen_1 + screen_4 (psychosis screening)
4. Merging into a Single Baseline DataFrame
The notebook reads alldays_df.csv (produced by 01_cleaning) to get the full participant ID list, then merges the six baseline tables on num_id to build one wide participant-level matrix. This mirrors the logic in 01_cleaning but scoped to baseline-only features.
5. Clustering Analysis
Using this baseline feature matrix, the notebook applies:
- KMeans clustering — partitions participants into k groups based on Euclidean distance in the baseline feature space; k is likely chosen via silhouette score evaluation
- Agglomerative (hierarchical) clustering — builds a dendrogram over pairwise distances (
pdist,squareform,linkage), enabling visualization of nested cluster structure - Silhouette scoring — quantifies cluster quality; used to select optimal k for KMeans
- Dendrogram visualization — renders the hierarchical merge tree to identify natural groupings
Categorical demographics are encoded via LabelEncoder before clustering (consistent with 01_cleaning).
This is the fifth notebook in the BRIGHTEN pipeline, following 01_cleaning → 02_Pipeline → 02_subject-networks → 03_demographic_clustering.
The primary goal is to predict PHQ-2 (daily depression screening) and PHQ-9 (weekly depression severity) scores from passive smartphone sensor features, self-report surveys, and baseline demographic/clinical data. The notebook explores this through several progressively refined modeling strategies: cross-subject population models, per-subject idiographic models, feature selection via ANOVA and correlation, SHAP-based feature importance, and slope/intercept temporal feature engineering.
CSV files from BRIGHTEN_data/, produced by 02_Pipeline.ipynb:
| File | Description |
|---|---|
{name}_trainval_transformed.csv |
Preprocessed, transformed long-format data for train/validation |
{name}_test_transformed.csv |
Held-out test set |
{name}_trainval_nonskew.csv |
Non-skew-transformed version, used for ANOVA feature selection |
{name}_wide_slopeintercept_2wk_outcomes.csv |
Wide-format slope/intercept features + 6-week outcome targets |
Dataset variants: v1_day, v2_day, v1_week, v2_week
| Model | Use Case |
|---|---|
HistGradientBoostingRegressor |
Primary regression model for PHQ-2/PHQ-9 scores |
HistGradientBoostingClassifier |
Classification variant for binary/categorical depression outcomes |
GroupMeanRegressor (custom) |
Dummy baseline: predicts each participant's own mean score |
RandomForestRegressor, XGBRegressor, Ridge |
Additional regressors defined but used selectively |
All regression models are evaluated via R² and MAE; cross-validation uses GroupKFold (n=5) to ensure participants are not split across train and test folds.
1. Setup & Imports
2. ANOVA Feature Selection (All Subjects)
For each subject individually, ANOVA F-scores (SelectKBest, f_classif) are computed between all sensor features and the target variable (PHQ-2 or PHQ-9). The top 10 features per subject are saved, and the most frequently top-ranked features across subjects are aggregated — yielding a population-level ranking of which features are most consistently informative for an individual subject. Results are visualized as frequency bar charts and saved to anova_features.
3. Population Model — All Features, PHQ-2
A HistGradientBoostingRegressor is trained on all available features (excluding PHQ columns) using a GroupShuffleSplit 80/20 participant-level split. Models are trained with max_iter ∈ {50, 100, 200} and evaluated by comparing predicted vs. actual average PHQ-2 over time. Permutation feature importance is computed on the test set for the best model and saved as a CSV.
4. Population Model — Correlation-Filtered Features, PHQ-2
Repeats the above but first filters to only features with |r| > 0.1 with the PHQ-2 target, reducing dimensionality before modeling. Permutation importance is again saved.
5. Per-Subject Model — Top Correlated Features
For each participant individually (filtered to ≥20 observations), the top 10 features most correlated with their own PHQ-2 scores are selected and used to fit a subject-specific HistGradientBoostingRegressor. The last 20% of each subject's days are held out as a test set (temporal split, not random), and predicted vs. actual PHQ-2 is plotted per subject.
6. Per-Subject Model — ANOVA Top 15 Features, PHQ-2
Same per-subject approach, but feature selection uses the population-level ANOVA top features (from Step 2) rather than each subject's own correlations. R² from the model is compared against a naive baseline (predicting the training-period mean) for each subject, and the improvement is saved as {name}_{y_col}_r2_model_increase_from_avg.csv.
7. Per-Subject Model — Subject-Specific ANOVA Top 10, PHQ-2
Repeats Step 6 but uses each subject's own top ANOVA features (from top_features[y_col][name][sub]) rather than the population aggregate. Both approaches are compared to quantify whether personalized feature selection adds value over population-level selection.
8. Population Model — PHQ-9 from Weekly Data
Switches target to phq9_sum using v1_week and v2_week. Uses the same ANOVA top-15 feature set and temporal per-subject splits. R² vs. baseline is tracked and saved per dataset variant.
9. HistGradientBoosting with Full Feature Sets — PHQ-9
Trains HistGradientBoostingRegressor on v1_week/v2_week using GroupKFold cross-validation across three feature configurations:
baseline: Demographic and clinical baseline features only8wks: Passive sensor features collected during the 8-week study (no baseline)both: All features combined
Reports mean R² and MAE per configuration. Also predicts on a held-out validation set. Results and trained models are stored in model_dict for downstream SHAP analysis.
10. SHAP Analysis — PHQ-9 Models
For each model/dataset/feature-set combination, SHAP values are computed using shap.Explainer with the first CV fold's model and test set. Bar plots of mean absolute SHAP values are displayed, and memory is explicitly freed between iterations with gc.collect().
11. Per-Subject Prediction Plots — PHQ-9
Predicted vs. actual PHQ-9 scores are plotted per subject using Plotly line charts, with per-subject R² and Pearson correlation displayed in the title. Results are aggregated into subject_phq9_preds_from_diff_models.csv.
12. Same Pipeline — Excluding All PHQ/Survey Variables (PHQ-9)
Repeats the full GroupKFold + SHAP + per-subject plotting pipeline for PHQ-9, but drops PHQ-2, SDS, stress, support, mood (MHS) columns — testing whether a model built on purely passive sensor data (no self-report survey items) retains predictive power. MHS items are summed into mhs_sum before dropping the individual items.
13. PHQ-2 Sum Prediction — Daily Data
Mirrors the PHQ-9 pipeline (steps 9–12) but targets phq2_sum using v1_day/v2_day, drops all PHQ-9 and survey-derived variables, and saves per-subject scores to subject_phq2_preds_from_diff_models.csv.
14. Subject Score Visualizations
Reads saved subject-level score CSVs and generates Plotly bar charts of R² and Pearson correlation per subject, colored by dataset variant, for both 8wks and both feature configurations.
15. Slope/Intercept Feature Modeling — Predicting 6-Week Outcomes
Loads the wide-format slope/intercept DataFrames (engineered in 02_Pipeline), and trains HistGradientBoostingRegressor/Classifier models to predict three 6-week outcomes: phq9_sum_6wks, 6wks_depressed_binary, and depression_change_bin. Feature sets are restricted to the first 2 weekly blocks (weeks 1–2 of data) to simulate early prediction. Results are stored in model_dict_slopeint.
16. SHAP for Slope/Intercept Models
SHAP analysis on the slope/intercept models, using shap.TreeExplainer for regressors and predict_proba-based explainers for classifiers. Top 20 features by mean absolute SHAP value are stored in top15features for use in the next step.
17. Reduced Feature Modeling — Top 15 SHAP Features
Re-trains slope/intercept models using only the top 15 SHAP-identified features per target/dataset combination, evaluating whether feature reduction (from 150–300 features down to 15) preserves predictive performance.
| File | Description |
|---|---|
{name}_feature_importances_maxIter{n}.csv |
Permutation feature importances from population HGBT models |
{name}_{y_col}_r2_model_increase_from_avg.csv |
Per-subject R² improvement over naive mean baseline (ANOVA features) |
{name}_{y_col}_sub-specific_r2_model_increase_from_avg.csv |
Same, using subject-specific ANOVA feature selection |
subject_phq9_preds_from_diff_models.csv |
Per-subject R² and Pearson r for PHQ-9 models |
subject_phq2_preds_from_diff_models.csv |
Per-subject R² and Pearson r for PHQ-2 models |
- A temporal train/test split (first 80% of days → train, last 20% → test) is used for per-subject models, rather than random splitting, to simulate real-world prospective prediction.
- A custom
GroupMeanRegressorserves as the within-person null: it predicts each subject's own training-period mean, which is a much stricter baseline than a global mean for longitudinal mental health data. - Models are evaluated across three distinct feature regimes (baseline-only, sensor-only, combined) to isolate the marginal contribution of passive sensing beyond what is explained by clinical demographics.
- The notebook is exploratory and modular — sections can be run independently given the right intermediate CSVs on disk. Not all cells are designed for sequential execution.
- V1 and V2 data are processed in parallel throughout; results should not be pooled across versions due to different sensor modalities.




