📊 Analysis Overview

This notebook performs a complete machine learning pipeline for flood risk prediction using geospatial raster features. The analysis is structured into 6 major parts.

Part 1 – Data Loading & Exploration

The dataset (Training_Data_16k.csv, ~16k samples) is loaded and explored. Features are categorized into continuous (Elevation, Slope, Rainfall, TWI, SPI, etc.), categorical (LULC, Geomorphology, Lithology, Soil), and one circular feature (Aspect). The binary target variable represents flood risk (0 = no flood, 1 = flood)

Part 2 – Data Preprocessing

Invalid/special values (-128, -1, NA, 65535) are replaced with NaN and rows with missing values are removed. Features are then processed based on their type:

Continuous → Z-score normalization (StandardScaler)
Categorical → One-hot encoding (with drop_first to avoid the dummy variable trap)
Circular (Aspect) → Sin/Cos transformation to preserve angular continuity

All processed feature groups are concatenated into a final dataset and saved to CSV files, along with visualizations of distributions before and after transformation.

Part 3 – Correlation Analysis

Three correlation methods are applied:

Pearson correlation between continuous features (flagging pairs with |r| > 0.7)
Point-Biserial correlation between all features and the binary target — top 20 most correlated features visualized
Chi-Square test + Cramér's V for categorical features vs. the target, with statistical significance testing

Part 4 – Collinearity & Feature Selection

Collinearity analysis identifies highly correlated feature pairs (|r| > 0.8) and computes VIF (Variance Inflation Factor) for all features, categorizing them as no / moderate / high / severe multicollinearity.

Feature selection is performed using four independent methods:

Mutual Information — measures dependency between each feature and the target
Random Forest Feature Importance — built-in importance from a trained RF model (100 trees, max depth 10)
Permutation Importance — measures accuracy drop when each feature is randomly shuffled
ANOVA F-Statistic — tests statistical difference in feature means between classes

All four methods are normalized and averaged into a Combined Ranking, producing recommended feature sets of 10, 15, and 20 top features. A consensus analysis identifies features that appear in the top 10 across multiple methods.

Part 5 – Accuracy Assessment

Twelve different feature sets are benchmarked (Top 10/20 from each individual method + combined sets + all features as baseline). For each set, a Random Forest classifier is trained and evaluated using:

Accuracy, Precision, Recall, F1-Score
5-fold Cross-Validation

Results are compared to identify the best trade-off between the number of features and model performance.

Part 6 – Hyperparameter Tuning

The best-performing feature set from Part 5 is used to fine-tune the Random Forest model in two stages:

Randomized Search (50 combinations, 3-fold CV) — broad exploration of the hyperparameter space
Grid Search (5-fold CV) — fine-grained search around the best parameters from step 1

The baseline, randomized search, and grid search models are compared across all metrics. A confusion matrix and full classification report are generated for the best model, and the final optimized parameters are saved.

📁 Output Files

The analysis produces a comprehensive set of saved files including processed datasets, feature ranking CSVs, accuracy assessment results, best model parameters, and 15+ visualization PNGs covering distributions, correlations, VIF, feature importances, and model comparisons.

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
assessment		assessment
notes		notes
pics		pics
README.md		README.md
Training_Data_16k.csv		Training_Data_16k.csv
accuracy_assessment_results.csv		accuracy_assessment_results.csv
analysis.ipynb		analysis.ipynb
anova_f_scores.csv		anova_f_scores.csv
best_model_parameters.csv		best_model_parameters.csv
combined_feature_ranking.csv		combined_feature_ranking.csv
correlation_matrix_continuous.csv		correlation_matrix_continuous.csv
correlation_with_target.csv		correlation_with_target.csv
data_categorical_encoded.csv		data_categorical_encoded.csv
data_circular_transformed.csv		data_circular_transformed.csv
data_consensus_features.csv		data_consensus_features.csv
data_continuous_normalized.csv		data_continuous_normalized.csv
data_processed_proper.csv		data_processed_proper.csv
data_top10_features.csv		data_top10_features.csv
data_top15_features.csv		data_top15_features.csv
data_top20_features.csv		data_top20_features.csv
high_correlation_pairs.csv		high_correlation_pairs.csv
hyperparameter_tuning_results.csv		hyperparameter_tuning_results.csv
mutual_information_scores.csv		mutual_information_scores.csv
permutation_importance.csv		permutation_importance.csv
random_forest_importance.csv		random_forest_importance.csv
target_clean.csv		target_clean.csv
vif_analysis.csv		vif_analysis.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📊 Analysis Overview

Part 1 – Data Loading & Exploration

Part 2 – Data Preprocessing

Part 3 – Correlation Analysis

Part 4 – Collinearity & Feature Selection

Part 5 – Accuracy Assessment

Part 6 – Hyperparameter Tuning

📁 Output Files

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

📊 Analysis Overview

Part 1 – Data Loading & Exploration

Part 2 – Data Preprocessing

Part 3 – Correlation Analysis

Part 4 – Collinearity & Feature Selection

Part 5 – Accuracy Assessment

Part 6 – Hyperparameter Tuning

📁 Output Files

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages