Smartphone sensor predictive modelling

This is a pipeline to predict daily and weekly mood (self-reported from PHQ surveys) from passive and active smartphone sensor features (mobility/movement amounts, communication metrics, other surveys). The data is from the Synapse.org BRIGHTEN (V1/V2) study, which can be downloaded online.

The BRIGHTEN dataset contains daily and weekly smartphone sensor data alongside clinical survey responses (PHQ-9, PHQ-2, GAD, etc.) collected across two study versions (v1, v2). The goal is to predict depression severity and change over an 8-week window.

Overall Pipeline

Notebook	Role
`01_cleaning`	Raw ingestion, merging, train/test split
`02_Pipeline`	Feature engineering, transformation, slope/intercept features
`02_subject-networks`	Per-subject PC correlation networks
`03_demographic_clustering`	Baseline demographic/clinical subgroup discovery
`04_predictive_models`	Full predictive modeling: population and per-subject, PHQ-2 and PHQ-9, ANOVA/SHAP feature selection, 6-week outcome prediction

Project Outline:

BRIGHTEN Data:
- Baseline: clinical scales (PHQ-9, PHQ-2, GAD-7, SDS, alcohol use, sleep quality, GIC mood scale, mental health services), baseline PHQ-9, demographics, and mania screening.
- Groups: Text-intervention group, control group
- Time-varying:
  - Passive data: data extracted from participants' smartphones, including movement metrics, communication metrics, weather metrics (V2 only)
  - Survey data:
    - Daily surveys: of PHQ2 (depression and anxiety level)
    - Weekly survey: of PHQ9 (in-depth depression level)
    - Less than weekly surveys: Sleep, Mood change since start of study, Stress change since start of study
Feature extraction:
- Hierarchical agglomerative clustering + PCA on variables between-persons
- Hierarchical agglomerative clustering + PCA averaging over within-person correlation structures
- Slope/intercept for 2-week chunks of variables within-person
- Baseline data PCA to define demographic + clinical groups at baseline
- Clinical vs. nonclinical identification, based on baseline clinical surveys
- Groups based on variable levels-- high mobility and low mobility groups; high communication and low communication groups, etc
- Outcome variables for grouping-- depressed at baseline, depressed after X weeks (end), etc
Analyses:
- ANOVA and MANOVA for extracting important features
- Within-person correlation structure (subject networks) of raw variables
- Within-person correlation structure (subject networks) of PC components
- Network based statistics (see Zalesky 2010) to look for differences in subject networks between clinical and nonclinical groups
Predictive modelling:
- Variety of ML structures testing variants across V1_day, V2_day, V1_week, V2_week Input data variants:
  - Using different sets of extracted features as inputs (described above)
  - Using extracted features + processed variables as inputs
  - Using
  Model structure variants:
  - Group: Using training set of individuals, and test set of different individuals
    - Note: Cross-validation uses GroupKFold (n=5) to ensure participants are not split across train and test folds.
  - Individual: Using training set of one individual's first 4 weeks, test set of that same individual's last 2 weeks
  Output prediction variants:
  - Binary prediction of depressed/not depressed at end
  - Predicting daily depression score for an individual (PHQ2 daily survey)
  - Predicting weekly depression score for an individual (PH9 weekly survey)
- Model Evaluations:
  - SHAP for feature importance within each model (averaged across 5 CVs)
  - Dummy regressors: For individual-level models, uses evaluation in comparison to just predicting an individuals' prior depression score average (if using past days to predict future days). For group-level models, uses evaluation in comparison to predicting the train group's depression score average.
  - R², MAE and Product-moment correlation (averaged across 5 CVs) for evaluation of predicted vs. real scores.

Notes on the project:

The dataset is kept split by study version (V1/V2) and temporal granularity (daily/weekly) throughout, since V1 and V2 and day/week have some different sensor modalities and survey questions. Thus the core DataFrames are: v1_day, v2_day, v1_week, v2_week.
Intermediate transformed files are read/written across steps (e.g. {name}_transformed.csv, {name}_baseFilled.csv where name ∈ v1_day, v2_day, v1_week, v2_week)
A participant-level group split prevents data leakage between train and test.
Some prediction is done using all the days of one group of subjects to predict others. Other prediction is done using some X number of days of one subject to predict their future days. If this is the case, the CV data is restricted to 'training days' and the test is restricted to 'test days'. (Generally individual training and prediction is limited, because subjects have mostly <100 days).
Binary indicator columns track missingness for passive sensor features, preserving the signal that data was absent.
Target variables are the PHQ-2 and PHQ-9 depression scales.

GITHUB REPO:

https://github.com/kaleyjoss/smartphone_sensor_modelling/blob/main/01_cleaning.ipynb

`01_cleaning.ipynb`

Overview

01_cleaning.ipynb takes the raw data from Synapse.org and cleans it to be able to go into the processing pipelines in the second notebook. It creates dfs for V1 and V2 data, separates by _day and _week, imputes missing data when appropriate, drops overly-missing subjects, and splits the data into train and test sets.

Structure & Flow

1. Setup & Configuration The notebook imports standard libraries and loads custom project scripts (preprocessing, visualization, variables). It defines key variable lists — column groupings for clinical scales, sensor features, and target variables (PHQ-2 and PHQ-9 depression scores).

2. Load Raw Files Reads in 15+ raw CSVs from the BRIGHTEN data directory, covering: clinical scales (PHQ-9, PHQ-2, GAD-7, SDS, alcohol use, sleep quality, GIC mood scale, mental health services), passive phone features (V1 and V2), weather data (V2), mobility/GPS data (V2), cluster entries (V2), baseline PHQ-9, demographics, and mania screening.

3. Clean & Standardize Column Names Renames columns to consistent snake_case names, creates binary/categorical summary columns (e.g., sum scores, category bins) for each clinical scale.

4. Demographics Encoding Encodes all categorical demographic variables (gender, race, education, marital status, etc.) using LabelEncoder, saves an encoder key CSV for interpretability, and generates a participant ID-to-numeric-ID mapping.

5. Merge Into a Single DataFrame Outer-joins all individual DataFrames on shared ID columns (participant_id, dt, week, version) into one large merged DataFrame, then saves it as raw_merged_df.csv.

6. Preprocess the Daily DF Combines duplicate rows from the same day (via averaging, forward-fill, and back-fill), then reindexes each participant's date range to include all consecutive days — not just days with observations — and adds temporal features (day of week, month, season). Produces alldays_df. The data is split by study version (V1 vs V2) and time granularity (daily vs weekly), yielding four core DataFrames: v1_day, v2_day, v1_week, v2_week.

7. Hours Accounted For (V2 Filtering) Inspects the GPS/mobility "hours accounted for" variable in V2. Rows with fewer than 6 hours of GPS coverage are set to NaN to avoid low-quality sensor data contaminating features. V2 distance metrics are also normalized per hour accounted for.

8. Create Weekly DF Aggregates the daily data into weekly summaries, keeping only participants with at least 4 weeks of data for weekly analyses.

9. NaN Analysis & Filtering Examines missingness per participant and per variable. Drops subjects or variables that fall below a coverage threshold. Flags certain zero-values (e.g., hours_of_sleep == 0) as likely missing rather than true zeros and converts them to NaN.

10. Train/Test Split Uses GroupShuffleSplit to split each of the four DataFrames (v1_day, v2_day, v1_week, v2_week) into train+validation and test sets, ensuring no participant appears in both sets. Saves {name}_trainval.csv and {name}_test.csv for each.

Notes/FYI

The dataset is kept split by study version (V1/V2) and temporal granularity (daily/weekly) throughout, since V1 and V2 have meaningfully different sensor modalities.
A participant-level group split prevents data leakage between train and test.
Binary indicator columns track missingness for passive sensor features, preserving the signal that data was absent.
Target variables are the PHQ-2 and PHQ-9 depression scales.

`02_Pipeline.ipynb`

Background

scripts/variables.py — project-specific column definitions (sensor cols, survey cols, ID cols, etc.)

Inputs

The notebook expects CSV files in a directory referenced by brighten_dir, named by convention:

{name}_trainval.csv — raw long-format data (where name ∈ {v1_day, v2_day, v1_week, v2_week})
Intermediate transformed files are read/written across steps (e.g. _transformed.csv, _baseFilled.csv)

Pipeline Steps

1. Distribution Analysis — Skewness and kurtosis are assessed for all numeric features to guide transformation choices.

2. Feature Transformation — A ColumnTransformer applies Yeo-Johnson or Box-Cox transforms to skewed columns, and ordinal/one-hot encoding to categorical demographics (race, gender, income, etc.).

3. Imputation & Gap Filling — Weekly columns are forward- and back-filled. Season is added as a derived feature from the date column.

4. Long → Wide Reshaping — Longitudinal data is filtered to 8 weeks and pivoted to one row per participant, aggregating daily/weekly features.

5. Target Construction — PHQ-9 change scores (continuous and categorical) are merged in as prediction targets.

6. Correlation Filtering — Highly correlated feature pairs (r > 0.7) are flagged and removed.

7. Slope/Intercept Feature Engineering — For each participant, linear regression is fit over 2-week rolling blocks of daily sensor data. The resulting slopes and intercepts are pivoted wide, creating a compact temporal summary.

8. Modeling — HistGradientBoostingRegressor and HistGradientBoostingClassifier are trained on both raw-wide and slope/intercept feature sets, evaluated via cross-validated R² and MAE.

9. SHAP Analysis — Feature importance is computed via SHAP for each model/dataset combination.

Key Findings

Raw wide features outperform slope/intercept features alone
Adding baseline demographic features improves performance similarly regardless of whether raw or slope/intercept features are used
Top predictive features (from SHAP) include screen_4, device, mhs_3, phq9_9_base, screen_3, and heard_about_us

Notes/FYI

The four dataset variants (v1_day, v2_day, v1_week, v2_week) are processed in parallel throughout — most loops iterate over all four
Several cells appear incomplete or are placeholders — the notebook is exploratory in nature and not all cells are intended to be run sequentially
Intermediate CSVs are written to disk between steps, so individual sections can be re-run independently

`02_subject-networks.ipynb`

BRIGHTEN Smartphone Sensor Pipeline — Part 3: Subject-Level Feature Networks

Overview

02_subject-networks.ipynb is the third notebook in the BRIGHTEN smartphone sensor modeling pipeline, following 01_cleaning.ipynb (raw data ingestion, merging, and participant-level train/test splitting) and 02_Pipeline.ipynb (distribution analysis, feature transformation, imputation, wide reshaping, slope/intercept feature engineering, HistGradientBoosting modeling, and SHAP analysis).

Where the prior notebooks treat the feature space statistically — via correlation filtering, PCA, and aggregate SHAP importance — this notebook shifts to a subject-level, network-based representation of feature relationships. The core idea is to examine how sensor-derived principal components co-vary within individual participants over time, visualizing intra-subject feature interaction structure using signed correlation networks.

Purpose

02_Pipeline reduced raw longitudinal sensor features into principal components (PCs) capturing latent structure across behavioral modalities (sleep, mobility, social interaction, mood self-report, etc.). A natural follow-on question is: do the relationships among these PCs differ meaningfully across individuals? Depressed individuals may show qualitatively different coupling between, say, mobility and mood self-report than non-depressed individuals, even if their mean feature values look similar. Within-person averaged networks operationalize this intuition, enabling idiographic analysis.

Structure & Pipeline Steps

1. Setup & Imports Standard scientific Python stack plus a custom feature_selection module (aliased fs) from the project's scripts/ directory. The fs.plotnetwork function is the primary workhorse.

2. Load PC-Transformed Data Reads PC-scored longitudinal DataFrames produced by 02_Pipeline — one per dataset variant (v1_day, v2_day, v1_week, v2_week). Each row is a participant-day or participant-week observation; columns are extracted PC components (e.g., pc_sleep_mood_mhs, pc_phq2, pc_calls, pc_mobility, pc_mobility-radius, pc_missed-interactions, pc_texts, pc_lieAwake, pc_morning-interac).

3. Per-Subject Network Construction For each participant, a signed correlation (or partial correlation) network is computed over their longitudinal PC time series. Each PC component becomes a node; edge weights = pairwise temporal co-variation for that individual. This is computed in a loop over subject IDs, with a filter for participants with sufficient longitudinal coverage to produce stable estimates (consistent with the ≥4-week minimum used in 02_Pipeline).

4. Network Visualization via fs.plotnetwork Each subject's network is rendered by fs.plotnetwork with signed edge coloring — green edges indicate positive temporal co-variation, red edges indicate negative co-variation. Plot titles encode dataset variant and participant ID (e.g., v1_day: sub 2074.0).

5. (Downstream) Network-Level Feature Extraction Subject networks can be summarized into graph-theoretic descriptors — edge density, signed modularity, hub nodes, mean edge weight — for use as additional features in the depression prediction models from 02_Pipeline, or as a standalone analysis of individual differences in behavioral feature coupling as a function of depression trajectory.

Inputs

File	Description
`{name}_transformed.csv` (or equivalent PC-scored output)	Long-format participant × time matrices with PC components as columns, produced by `02_Pipeline.ipynb`

Dataset variants: v1_day, v2_day, v1_week, v2_week

The PC components visualized here are the same latent dimensions identified as predictively relevant in 02_Pipeline (e.g., pc_sleep_mood_mhs, pc_phq2, pc_mobility). This notebook provides an individualized, temporal-relational lens on features that the modeling notebook treats in aggregate.

Notes

Visualizations are cleared from the committed notebook; re-run all cells to regenerate plots.
Network estimation is per-subject — participants with sparse longitudinal coverage will produce unstable estimates. Apply the same ≥4-week minimum used upstream.
V1 and V2 have different PC compositions due to different sensor modalities (V2 includes GPS/mobility features absent from V1); networks should not be compared directly across versions.
The notebook is exploratory in nature, consistent with the conventions of the broader pipeline.

Example Within-subject PC correlation networks

`03_demographic_clustering.ipynb`

This is the fourth notebook in the BRIGHTEN pipeline, sitting after 01_cleaning → 02_Pipeline → 02_subject-networks. Where prior notebooks focus on sensor features and temporal modeling, this one turns to static baseline characteristics, (demographics and symptoms) — and investigates whether clinically or demographically meaningful subgroups exist in the study sample. Prior notebooks work with longitudinal sensor data, bu tthis notebook works with baseline-only snapshots, so one row per participant.

Purpose

The goal is to discover whether the BRIGHTEN sample contains meaningful clinical subgroups — e.g., a high-anxiety, low-depression cluster vs. a psychosis-risk cluster vs. a mild-symptoms cluster — that might moderate the relationship between sensor features and depression outcomes found in 02_Pipeline. Cluster labels could feed back as stratification variables for the predictive models, or serve as a standalone characterization of who participates in digital mental health research.

Inputs

Rather than reading the longitudinal _trainval.csv files, this notebook reads a specific set of baseline CSVs directly from BRIGHTEN_data/:

PHQ-9 - Baseline.csv — baseline depression severity (9-item scale)
Baseline Demographics.csv — gender, race, education, marital status, etc.
IMPACT Mania and Psychosis Screening.csv — screening for bipolar/psychosis symptoms
Alcohol.csv — alcohol use
GAD - Anxiety.csv — baseline anxiety severity (7-item scale)
Mental Health Services.csv — current service utilization
id_key.csv — maps participant_id to anonymized num_id

Outputs

Cluster variation for each demographic variable

Cluster profiles

Structure & Flow

1. Setup & Imports

Loads the standard scientific Python stack plus three custom project scripts: preprocessing, visualization, and a clustering (aliased cl). This clustering module is unique to this notebook and wraps KMeans, agglomerative clustering, silhouette scoring, and dendrogram utilities, which are also imported directly from sklearn/scipy.

2. Data Loading — Baseline Clinical & Demographic Tables

3. Cleaning & Feature Construction

Each loaded DataFrame is passed through a consistent cleaning loop that:

Merges in the id_key and drops participant_id to preserve anonymity
Drops irrelevant or redundant columns (dt, study, cohort, device, heard_about_us, etc.)
Computes composite sum scores where applicable:
- phq9_sum_base — sum of all 9 PHQ-9 baseline items
- mhs_sum — sum of mental health service use items
- bipolar — sum of screen_2 + screen_3 (bipolar screening)
- scz — sum of screen_1 + screen_4 (psychosis screening)

4. Merging into a Single Baseline DataFrame

The notebook reads alldays_df.csv (produced by 01_cleaning) to get the full participant ID list, then merges the six baseline tables on num_id to build one wide participant-level matrix. This mirrors the logic in 01_cleaning but scoped to baseline-only features.

5. Clustering Analysis

Using this baseline feature matrix, the notebook applies:

KMeans clustering — partitions participants into k groups based on Euclidean distance in the baseline feature space; k is likely chosen via silhouette score evaluation
Agglomerative (hierarchical) clustering — builds a dendrogram over pairwise distances (pdist, squareform, linkage), enabling visualization of nested cluster structure
Silhouette scoring — quantifies cluster quality; used to select optimal k for KMeans
Dendrogram visualization — renders the hierarchical merge tree to identify natural groupings

Categorical demographics are encoded via LabelEncoder before clustering (consistent with 01_cleaning).

`04_predictive_models.ipynb`

This is the fifth notebook in the BRIGHTEN pipeline, following 01_cleaning → 02_Pipeline → 02_subject-networks → 03_demographic_clustering.

Purpose

The primary goal is to predict PHQ-2 (daily depression screening) and PHQ-9 (weekly depression severity) scores from passive smartphone sensor features, self-report surveys, and baseline demographic/clinical data. The notebook explores this through several progressively refined modeling strategies: cross-subject population models, per-subject idiographic models, feature selection via ANOVA and correlation, SHAP-based feature importance, and slope/intercept temporal feature engineering.

Inputs

CSV files from BRIGHTEN_data/, produced by 02_Pipeline.ipynb:

File	Description
`{name}_trainval_transformed.csv`	Preprocessed, transformed long-format data for train/validation
`{name}_test_transformed.csv`	Held-out test set
`{name}_trainval_nonskew.csv`	Non-skew-transformed version, used for ANOVA feature selection
`{name}_wide_slopeintercept_2wk_outcomes.csv`	Wide-format slope/intercept features + 6-week outcome targets

Dataset variants: v1_day, v2_day, v1_week, v2_week

Models

Model	Use Case
`HistGradientBoostingRegressor`	Primary regression model for PHQ-2/PHQ-9 scores
`HistGradientBoostingClassifier`	Classification variant for binary/categorical depression outcomes
`GroupMeanRegressor` (custom)	Dummy baseline: predicts each participant's own mean score
`RandomForestRegressor`, `XGBRegressor`, `Ridge`	Additional regressors defined but used selectively

All regression models are evaluated via R² and MAE; cross-validation uses GroupKFold (n=5) to ensure participants are not split across train and test folds.

Structure & Flow

1. Setup & Imports

2. ANOVA Feature Selection (All Subjects)

For each subject individually, ANOVA F-scores (SelectKBest, f_classif) are computed between all sensor features and the target variable (PHQ-2 or PHQ-9). The top 10 features per subject are saved, and the most frequently top-ranked features across subjects are aggregated — yielding a population-level ranking of which features are most consistently informative for an individual subject. Results are visualized as frequency bar charts and saved to anova_features.

3. Population Model — All Features, PHQ-2

A HistGradientBoostingRegressor is trained on all available features (excluding PHQ columns) using a GroupShuffleSplit 80/20 participant-level split. Models are trained with max_iter ∈ {50, 100, 200} and evaluated by comparing predicted vs. actual average PHQ-2 over time. Permutation feature importance is computed on the test set for the best model and saved as a CSV.

4. Population Model — Correlation-Filtered Features, PHQ-2

Repeats the above but first filters to only features with |r| > 0.1 with the PHQ-2 target, reducing dimensionality before modeling. Permutation importance is again saved.

5. Per-Subject Model — Top Correlated Features

For each participant individually (filtered to ≥20 observations), the top 10 features most correlated with their own PHQ-2 scores are selected and used to fit a subject-specific HistGradientBoostingRegressor. The last 20% of each subject's days are held out as a test set (temporal split, not random), and predicted vs. actual PHQ-2 is plotted per subject.

6. Per-Subject Model — ANOVA Top 15 Features, PHQ-2

Same per-subject approach, but feature selection uses the population-level ANOVA top features (from Step 2) rather than each subject's own correlations. R² from the model is compared against a naive baseline (predicting the training-period mean) for each subject, and the improvement is saved as {name}_{y_col}_r2_model_increase_from_avg.csv.

7. Per-Subject Model — Subject-Specific ANOVA Top 10, PHQ-2

Repeats Step 6 but uses each subject's own top ANOVA features (from top_features[y_col][name][sub]) rather than the population aggregate. Both approaches are compared to quantify whether personalized feature selection adds value over population-level selection.

8. Population Model — PHQ-9 from Weekly Data

Switches target to phq9_sum using v1_week and v2_week. Uses the same ANOVA top-15 feature set and temporal per-subject splits. R² vs. baseline is tracked and saved per dataset variant.

9. HistGradientBoosting with Full Feature Sets — PHQ-9

Trains HistGradientBoostingRegressor on v1_week/v2_week using GroupKFold cross-validation across three feature configurations:

baseline: Demographic and clinical baseline features only
8wks: Passive sensor features collected during the 8-week study (no baseline)
both: All features combined

Reports mean R² and MAE per configuration. Also predicts on a held-out validation set. Results and trained models are stored in model_dict for downstream SHAP analysis.

10. SHAP Analysis — PHQ-9 Models

For each model/dataset/feature-set combination, SHAP values are computed using shap.Explainer with the first CV fold's model and test set. Bar plots of mean absolute SHAP values are displayed, and memory is explicitly freed between iterations with gc.collect().

11. Per-Subject Prediction Plots — PHQ-9

Predicted vs. actual PHQ-9 scores are plotted per subject using Plotly line charts, with per-subject R² and Pearson correlation displayed in the title. Results are aggregated into subject_phq9_preds_from_diff_models.csv.

12. Same Pipeline — Excluding All PHQ/Survey Variables (PHQ-9)

Repeats the full GroupKFold + SHAP + per-subject plotting pipeline for PHQ-9, but drops PHQ-2, SDS, stress, support, mood (MHS) columns — testing whether a model built on purely passive sensor data (no self-report survey items) retains predictive power. MHS items are summed into mhs_sum before dropping the individual items.

13. PHQ-2 Sum Prediction — Daily Data

Mirrors the PHQ-9 pipeline (steps 9–12) but targets phq2_sum using v1_day/v2_day, drops all PHQ-9 and survey-derived variables, and saves per-subject scores to subject_phq2_preds_from_diff_models.csv.

14. Subject Score Visualizations

Reads saved subject-level score CSVs and generates Plotly bar charts of R² and Pearson correlation per subject, colored by dataset variant, for both 8wks and both feature configurations.

15. Slope/Intercept Feature Modeling — Predicting 6-Week Outcomes

Loads the wide-format slope/intercept DataFrames (engineered in 02_Pipeline), and trains HistGradientBoostingRegressor/Classifier models to predict three 6-week outcomes: phq9_sum_6wks, 6wks_depressed_binary, and depression_change_bin. Feature sets are restricted to the first 2 weekly blocks (weeks 1–2 of data) to simulate early prediction. Results are stored in model_dict_slopeint.

16. SHAP for Slope/Intercept Models

SHAP analysis on the slope/intercept models, using shap.TreeExplainer for regressors and predict_proba-based explainers for classifiers. Top 20 features by mean absolute SHAP value are stored in top15features for use in the next step.

17. Reduced Feature Modeling — Top 15 SHAP Features

Re-trains slope/intercept models using only the top 15 SHAP-identified features per target/dataset combination, evaluating whether feature reduction (from 150–300 features down to 15) preserves predictive performance.

Outputs

File	Description
`{name}_feature_importances_maxIter{n}.csv`	Permutation feature importances from population HGBT models
`{name}_{y_col}_r2_model_increase_from_avg.csv`	Per-subject R² improvement over naive mean baseline (ANOVA features)
`{name}_{y_col}_sub-specific_r2_model_increase_from_avg.csv`	Same, using subject-specific ANOVA feature selection
`subject_phq9_preds_from_diff_models.csv`	Per-subject R² and Pearson r for PHQ-9 models
`subject_phq2_preds_from_diff_models.csv`	Per-subject R² and Pearson r for PHQ-2 models

Key Design Notes

A temporal train/test split (first 80% of days → train, last 20% → test) is used for per-subject models, rather than random splitting, to simulate real-world prospective prediction.
A custom GroupMeanRegressor serves as the within-person null: it predicts each subject's own training-period mean, which is a much stricter baseline than a global mean for longitudinal mental health data.
Models are evaluated across three distinct feature regimes (baseline-only, sensor-only, combined) to isolate the marginal contribution of passive sensing beyond what is explained by clinical demographics.
The notebook is exploratory and modular — sections can be run independently given the right intermediate CSVs on disk. Not all cells are designed for sequential execution.
V1 and V2 data are processed in parallel throughout; results should not be pooled across versions due to different sensor modalities.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
archive		archive
images		images
scripts		scripts
.DS_Store		.DS_Store
.gitignore		.gitignore
01_cleaning.ipynb		01_cleaning.ipynb
02_processing_Pipeline.ipynb		02_processing_Pipeline.ipynb
03_demographic_clustering.ipynb		03_demographic_clustering.ipynb
03_feature_pca.ipynb		03_feature_pca.ipynb
04_prediction.ipynb		04_prediction.ipynb
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Smartphone sensor predictive modelling

Overall Pipeline

Project Outline:

`01_cleaning.ipynb`

`02_Pipeline.ipynb`

Inputs

Pipeline Steps

Key Findings

`02_subject-networks.ipynb`

Overview

Structure & Pipeline Steps

Inputs

`03_demographic_clustering.ipynb`

Purpose

Inputs

Outputs

Structure & Flow

`04_predictive_models.ipynb`

Purpose

Inputs

Models

Structure & Flow

Outputs

Key Design Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Smartphone sensor predictive modelling

Overall Pipeline

Project Outline:

01_cleaning.ipynb

02_Pipeline.ipynb

Inputs

Pipeline Steps

Key Findings

02_subject-networks.ipynb

Overview

Structure & Pipeline Steps

Inputs

03_demographic_clustering.ipynb

Purpose

Inputs

Outputs

Structure & Flow

04_predictive_models.ipynb

Purpose

Inputs

Models

Structure & Flow

Outputs

Key Design Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`01_cleaning.ipynb`

`02_Pipeline.ipynb`

`02_subject-networks.ipynb`

`03_demographic_clustering.ipynb`

`04_predictive_models.ipynb`

Packages