-
Notifications
You must be signed in to change notification settings - Fork 19
Simplified dynamic profiling data model and heartbeat document #76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
ashokbytebytego
wants to merge
1
commit into
master
Choose a base branch
from
dynamic_profiling_data_model
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,345 @@ | ||
| # gProfiler Performance Studio - Heartbeat-Based Profiling Control | ||
|
|
||
| This document describes the heartbeat-based profiling control system that allows dynamic start/stop of profiling sessions through API commands. | ||
|
|
||
| ## Overview | ||
|
|
||
| The heartbeat system enables remote control of gProfiler agents through a simple yet robust mechanism: | ||
|
|
||
| 1. **Agents send periodic heartbeats** to the Performance Studio backend | ||
| 2. **Backend responds with profiling commands** (start/stop) when available | ||
| 3. **Agents execute commands with built-in idempotency** to prevent duplicate execution | ||
| 4. **Commands are tracked and logged** for audit and debugging | ||
|
|
||
| ## Architecture | ||
|
|
||
| ``` | ||
| ┌─────────────────┐ heartbeat ┌──────────────────────┐ | ||
| │ gProfiler │ ──────────────► │ Performance Studio │ | ||
| │ Agent │ │ Backend │ | ||
| │ │ ◄────────────── │ │ | ||
| └─────────────────┘ commands └──────────────────────┘ | ||
| │ │ | ||
| │ │ | ||
| ▼ ▼ | ||
| ┌─────────────────┐ ┌──────────────────────┐ | ||
| │ Profile Data │ │ PostgreSQL DB │ | ||
| │ (S3/Local) │ │ - Host Heartbeats │ | ||
| └─────────────────┘ │ - Profiling Cmds │ | ||
| └──────────────────────┘ | ||
| ``` | ||
|
|
||
| ## Database Schema | ||
|
|
||
| ### Core Tables | ||
|
|
||
| 1. **HostHeartbeats** - Track agent status and last seen information | ||
| 2. **ProfilingRequests** - Store profiling requests from API calls | ||
| 3. **ProfilingCommands** - Commands sent to agents (merged from multiple requests) | ||
| 4. **ProfilingExecutions** - Execution history for audit trail | ||
|
|
||
| ### Key Features | ||
| - **Simple DDL** with essential indexes only | ||
| - **No stored procedures** - all logic in application code | ||
| - **No triggers** - timestamps handled by application | ||
| - **Consistent naming** with `idx_` prefix for all indexes | ||
|
|
||
| ## API Endpoints | ||
|
|
||
| ### 1. Create Profiling Request | ||
| ```http | ||
| POST /api/metrics/profile_request | ||
| Content-Type: application/json | ||
|
|
||
| { | ||
| "service_name": "my-service", | ||
| "request_type": "start", | ||
| "duration": 60, | ||
| "frequency": 11, | ||
| "profiling_mode": "cpu", | ||
| "target_hosts": { | ||
| "host1": [1234, 5678], | ||
| "host2": null | ||
| }, | ||
| "stop_level": "process", | ||
| "additional_args": {} | ||
| } | ||
| ``` | ||
|
|
||
| **Response:** | ||
| ```json | ||
| { | ||
| "success": true, | ||
| "message": "Start profiling request submitted successfully", | ||
| "request_id": "uuid", | ||
| "command_id": "uuid", | ||
| "estimated_completion_time": "2025-01-15T10:30:00Z" | ||
| } | ||
| ``` | ||
|
|
||
| ### 2. Agent Heartbeat | ||
| ```http | ||
| POST /api/metrics/heartbeat | ||
| Content-Type: application/json | ||
|
|
||
| { | ||
| "hostname": "host1", | ||
| "ip_address": "10.0.1.100", | ||
| "service_name": "my-service", | ||
| "last_command_id": "previous-command-uuid", | ||
| "status": "active" | ||
| } | ||
| ``` | ||
|
|
||
| **Response (with command):** | ||
| ```json | ||
| { | ||
| "success": true, | ||
| "message": "Heartbeat received. New profiling command available.", | ||
| "command_id": "new-command-uuid", | ||
| "profiling_command": { | ||
| "command_type": "start", | ||
| "combined_config": { | ||
| "duration": 60, | ||
| "frequency": 11, | ||
| "profiling_mode": "cpu", | ||
| "pids": "1234,5678" | ||
| } | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| ### 3. Command Completion | ||
| ```http | ||
| POST /api/metrics/command_completion | ||
| Content-Type: application/json | ||
|
|
||
| { | ||
| "command_id": "command-uuid", | ||
| "hostname": "host1", | ||
| "status": "completed", | ||
| "execution_time": 65, | ||
| "results_path": "/path/to/results" | ||
| } | ||
| ``` | ||
|
|
||
| ## Agent Integration | ||
|
|
||
| ### Heartbeat Configuration | ||
| ```bash | ||
| python3 gprofiler/main.py \ | ||
| --enable-heartbeat-server \ | ||
| --api-server "https://perf-studio.example.com" \ | ||
| --heartbeat-interval 30 \ | ||
| --service-name "my-service" \ | ||
| --token "api-token" | ||
| ``` | ||
|
|
||
| ### Heartbeat Flow | ||
| 1. **Agent sends heartbeat** every 30 seconds (configurable) | ||
| 2. **Backend checks for pending commands** for this hostname/service | ||
| 3. **If command available**, backend responds with command details | ||
| 4. **Agent executes command** and reports completion | ||
| 5. **Idempotency ensured** by tracking `last_command_id` | ||
|
|
||
| ## Command Types | ||
|
|
||
| ### START Commands | ||
| - Create new profiling sessions | ||
| - Merge multiple requests for same host | ||
| - Include combined configuration (duration, frequency, PIDs) | ||
|
|
||
| ### STOP Commands | ||
| - **Process-level**: Stop specific PIDs | ||
| - **Host-level**: Stop entire profiling session | ||
| - Automatic conversion when only one PID remains | ||
|
|
||
| ## Data Flow Example | ||
|
|
||
| ### 1. Create Profiling Request | ||
| ```bash | ||
| curl -X POST http://localhost:8000/api/metrics/profile_request \ | ||
| -H "Content-Type: application/json" \ | ||
| -d '{ | ||
| "service_name": "web-service", | ||
| "request_type": "start", | ||
| "duration": 120, | ||
| "target_hostnames": ["web-01", "web-02"], | ||
| "profiling_mode": "cpu" | ||
| }' | ||
| ``` | ||
|
|
||
| ### 2. Agent Heartbeat | ||
| ```bash | ||
| # Agent automatically sends: | ||
| curl -X POST http://localhost:8000/api/metrics/heartbeat \ | ||
| -H "Content-Type: application/json" \ | ||
| -d '{ | ||
| "hostname": "web-01", | ||
| "ip_address": "10.0.1.10", | ||
| "service_name": "web-service", | ||
| "status": "active" | ||
| }' | ||
| ``` | ||
|
|
||
| ### 3. Agent Receives Command | ||
| ```json | ||
| { | ||
| "success": true, | ||
| "command_id": "cmd-12345", | ||
| "profiling_command": { | ||
| "command_type": "start", | ||
| "combined_config": { | ||
| "duration": 120, | ||
| "frequency": 11, | ||
| "profiling_mode": "cpu" | ||
| } | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| ### 4. Agent Reports Completion | ||
| ```bash | ||
| curl -X POST http://localhost:8000/api/metrics/command_completion \ | ||
| -H "Content-Type: application/json" \ | ||
| -d '{ | ||
| "command_id": "cmd-12345", | ||
| "hostname": "web-01", | ||
| "status": "completed", | ||
| "execution_time": 122 | ||
| }' | ||
| ``` | ||
|
|
||
| ## Testing | ||
|
|
||
| ### 1. Test Heartbeat System | ||
| ```bash | ||
| cd gprofiler-performance-studio | ||
| python3 test_heartbeat_system.py | ||
| ``` | ||
|
|
||
| This script: | ||
| - Simulates agent heartbeat behavior | ||
| - Creates test profiling requests | ||
| - Verifies command delivery and idempotency | ||
| - Tests both start and stop commands | ||
|
|
||
| ### 2. Run Test Agent | ||
| ```bash | ||
| python3 run_heartbeat_agent.py | ||
| ``` | ||
|
|
||
| This script: | ||
| - Starts a real gProfiler agent in heartbeat mode | ||
| - Connects to the Performance Studio backend | ||
| - Receives and executes actual profiling commands | ||
|
|
||
| ## Configuration | ||
|
|
||
| ### Backend Configuration | ||
| ```yaml | ||
| # Backend settings | ||
| database: | ||
| host: localhost | ||
| port: 5432 | ||
| database: gprofiler | ||
|
|
||
| heartbeat: | ||
| max_age_minutes: 10 # Consider hosts offline after 10 minutes | ||
| cleanup_interval: 300 # Clean up old records every 5 minutes | ||
| ``` | ||
|
|
||
| ### Agent Configuration | ||
| ```bash | ||
| # Required parameters | ||
| --enable-heartbeat-server # Enable heartbeat mode | ||
| --api-server URL # Performance Studio backend URL | ||
| --service-name NAME # Service identifier | ||
| --heartbeat-interval SECONDS # Heartbeat frequency (default: 30) | ||
|
|
||
| # Optional parameters | ||
| --token TOKEN # Authentication token | ||
| --server-host URL # Profile upload server (can be same as api-server) | ||
| --no-verify # Skip SSL verification (testing only) | ||
| ``` | ||
|
|
||
| ## Monitoring and Debugging | ||
|
|
||
| ### Database Queries | ||
| ```sql | ||
| -- Check active hosts | ||
| SELECT hostname, service_name, status, heartbeat_timestamp | ||
| FROM HostHeartbeats | ||
| WHERE status = 'active' AND heartbeat_timestamp > NOW() - INTERVAL '10 minutes'; | ||
|
|
||
| -- Check pending commands | ||
| SELECT hostname, service_name, command_type, status, created_at | ||
| FROM ProfilingCommands | ||
| WHERE status = 'pending'; | ||
|
|
||
| -- Check command execution history | ||
| SELECT pe.hostname, pr.request_type, pe.status, pe.execution_time | ||
| FROM ProfilingExecutions pe | ||
| JOIN ProfilingRequests pr ON pe.profiling_request_id = pr.ID | ||
| ORDER BY pe.created_at DESC; | ||
| ``` | ||
|
|
||
| ### Log Monitoring | ||
| ```bash | ||
| # Backend logs | ||
| tail -f /var/log/gprofiler-studio/backend.log | grep -E "(heartbeat|command)" | ||
|
|
||
| # Agent logs | ||
| tail -f /tmp/gprofiler-heartbeat.log | grep -E "(heartbeat|command)" | ||
| ``` | ||
|
|
||
| ## Troubleshooting | ||
|
|
||
| ### Common Issues | ||
|
|
||
| 1. **Agents not receiving commands** | ||
| - Check heartbeat connectivity to backend | ||
| - Verify service_name matches between request and agent | ||
| - Check agent authentication (token) | ||
|
|
||
| 2. **Commands executing multiple times** | ||
| - Verify agent is tracking `last_command_id` correctly | ||
| - Check for agent restarts that reset command tracking | ||
|
|
||
| 3. **Commands not being created** | ||
| - Verify `target_hostnames` includes the agent's hostname | ||
| - Check database constraints and foreign key relationships | ||
|
|
||
| ### Debug Commands | ||
| ```bash | ||
| # Test backend connectivity | ||
| curl -v http://localhost:8000/api/metrics/heartbeat \ | ||
| -H "Content-Type: application/json" \ | ||
| -d '{"hostname":"test","ip_address":"127.0.0.1","service_name":"test","status":"active"}' | ||
|
|
||
| # Check database state | ||
| psql -d gprofiler -c "SELECT * FROM HostHeartbeats ORDER BY heartbeat_timestamp DESC LIMIT 5;" | ||
| ``` | ||
|
|
||
| ## Security Considerations | ||
|
|
||
| 1. **Authentication**: Use API tokens for agent authentication | ||
| 2. **Network**: Secure communication with HTTPS/TLS | ||
| 3. **Authorization**: Validate service permissions before creating commands | ||
| 4. **Rate Limiting**: Implement rate limits on heartbeat endpoints | ||
| 5. **Input Validation**: Sanitize all input parameters | ||
|
|
||
| ## Performance Considerations | ||
|
|
||
| 1. **Database Indexes**: Essential indexes are created for all lookup patterns | ||
| 2. **Heartbeat Frequency**: Balance between responsiveness and load (default: 30s) | ||
| 3. **Command Cleanup**: Implement periodic cleanup of old commands/executions | ||
| 4. **Connection Pooling**: Use connection pooling for database access | ||
|
|
||
| ## Future Enhancements | ||
|
|
||
| 1. **Agent Discovery**: Automatic service registration | ||
| 2. **Command Queuing**: Support for command queues per host | ||
| 3. **Conditional Commands**: Commands based on host metrics or state | ||
| 4. **Command Templates**: Predefined command templates for common scenarios | ||
| 5. **Real-time Dashboard**: Web UI for monitoring active agents and commands | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This JOIN condition is inconsistent with the foreign key definition in the schema. The schema's foreign key constraint references ProfilingRequests(request_id), but this query joins on pr.ID. This query will fail since profiling_request_id contains UUID values from the request_id column, not bigint IDs. The correct join should be: JOIN ProfilingRequests pr ON pe.profiling_request_id = pr.request_id