This Python script implements an advanced process resource monitoring and anomaly detection system using K-Means clustering and machine learning techniques. The system analyzes system processes based on their resource usage patterns and identifies potential anomalies.
-
Resource Monitoring: Tracks multiple system metrics including:
- CPU usage percentage
- Memory usage percentage
- Disk read/write rates (MB)
- Network sent/received rates (MB)
-
Advanced Analytics:
- K-Means clustering for behavioral profiling
- Principal Component Analysis (PCA) for dimensionality reduction
- Hybrid anomaly detection using:
- Isolation Forest
- Distance-based metrics
- Interactive 3D visualization of process clusters
-
Comprehensive Reporting:
- Detailed cluster analysis
- Anomaly detection results
- Performance metrics
- Interactive visualizations
# Core Libraries
numpy
pandas
joblib
# Visualization
matplotlib
plotly
plotly.express
plotly.graph_objects
# Machine Learning
scikit-learn- Clone this repository or download the script
- Install required packages:
pip install numpy pandas joblib matplotlib plotly scikit-learn-
Prepare your process log data in CSV format with the following columns:
- cpu_percent
- memory_percent
- disk_read_mb
- disk_write_mb
- net_sent_mb
- net_recv_mb
-
Update the configuration section in the script:
DATA_PATH = 'process_log.csv' # Path to your input data
SAVE_DIR = 'models' # Directory to save models- Run the script:
python process_resource_clustering_k_means.pyThe script is organized into multiple cells, each handling a specific aspect of the analysis:
-
Setup & Configuration (Cell 1)
- Library imports
- Global configuration
- Utility functions
-
Data Loading & Inspection (Cell 2)
- Data loading
- Initial validation
- Schema verification
-
Exploratory Data Analysis (Cells 3-5)
- Feature distributions
- Correlation analysis
- Temporal behavior analysis
-
Data Preprocessing (Cells 6-7)
- Missing value handling
- Scaling
- PCA transformation
-
Model Development (Cells 8-10)
- K-Means model selection
- Model training
- Performance evaluation
-
Anomaly Detection (Cell 11)
- Hybrid approach implementation
- Distance-based detection
- Isolation Forest integration
-
Visualization (Cells 12-14)
- Interactive 3D cluster visualization
- Inference overlay
- Test case visualization
-
Reporting (Cells 15-17)
- Data export
- Summary metrics
- Interpretive analysis
- Anomaly intelligence report
The script generates several outputs:
-
Trained models saved in the
modelsdirectory:scaler.joblib: StandardScaler modelpca.joblib: PCA modelkmeans_final.joblib: Final K-Means modelisolation_forest.joblib: Isolation Forest modeldist_minmax.joblib: MinMax scaler for distances
-
Comprehensive CSV report:
cluster_anomaly_summary.csv: Detailed analysis results
-
Interactive visualizations:
- 3D cluster plots
- Feature distribution plots
- Correlation heatmaps
The script evaluates clustering performance using multiple metrics:
- Silhouette Score
- Calinski-Harabasz Index
- Davies-Bouldin Index
This project is open-source and available under the MIT License.
Swarag V S
Contributions are welcome! Please feel free to submit a Pull Request.











